This program will transfer tables from MySQL like databases to Redshift.
Build project using SBT and create fat jar using sbt assembly
.
This is spark application and requires platform which can run spark on it like yarn, spark-cluster & mesos.
Submitting spark job on yarn :-
spark-submit \
--class com.goibibo.sqlshift.SQLShift \
--master yarn \
sqlshift-assembly-0.1.jar -td table-details.json
Submitting spark job locally :-
spark-submit \
--class com.goibibo.sqlshift.SQLShift \
--master local[*] \
sqlshift-assembly-0.1.jar -td table-details.json
More Options
MySQL to Redshift DataPipeline
Usage: SQLShift [options]
-td, --table-details <path to Json>
Table details json file path including
-mail, --mail-details <path to properties file>
Mail details property file path(For enabling mail)
-aof, --alert-on-failure
Alert only when fails
-rc, --retry-count <count>
How many times to retry on failed transfers
-lmr, --log-metrics-reporter
Enable metrics reporting in logs
-jmr, --jmx-metrics-reporter
Enable metrics reporting through JMX
-mws, --metrics-window-size <value>
Metrics window size in seconds. Default: 5 seconds
--help Usage Of arguments
[
{
"mysql":
{
"db": "",
"hostname": "",
"portno": "",
"username": "",
"password": ""
},
"tables": [
{
"name": "",
"incremental": "",
"mergeKey": "",
"addColumn": "",
"partitions": "",
"isSplittable": true / false,
"isAppendOnly": true / false,
"preLoadCmd":"",
"postLoadCmd":""
}
],
"redshift":
{
"schema": "",
"hostname": "",
"portno": "",
"username": "",
"password": ""
},
"s3":
{
"location": "",
"accessKey": "",
"secretKey": ""
}
}
]
alert.host =
alert.port = 25
alert.username =
alert.password =
alert.to =
alert.cc =
alert.subject =
localhost
with listening port 3306
. Server has user root
with
admin
as password. Now we have to migrate complete table test
from default
database to redshift in schema base
which is running on remote host.[
{
"mysql": {
"db": "default",
"hostname": "localhost",
"portno": 3306,
"username": "root",
"password": "admin"
},
"tables": [
{
"name": "test"
}
],
"redshift": {
"schema": "base",
"hostname": "redshift.amazonaws.com",
"portno": 5439,
"username": "admin",
"password": "admin"
},
"s3": {
"location": "s3a://bucket/",
"accessKey": "access key",
"secretKey": "secret key"
}
}
]
Note[1]- Full dump of tables is good but if table is too large(like in 10s of GBs) then it can take time. So if you want to lift small part of table then you can do it using incremental facility provided.
Incremental:- In incremental just add where
condition without where
clause in configuration. Let's say table test
has
modified_at
column which is timestamp, changes when an entry/record is updated or created and we need data of day.
Value in incremental should be SQL which is compatible with MYSQL.
[
{
"mysql": {...},
"tables": [
{
"name": "test",
"incremental": "modified_at BETWEEN '2016-12-30 00:00:00' AND '2016-12-30 23:59:59'"
}
],
"redshift": {...},
"s3": {...}
}
]
Note[2]- Program will works better if you have only one integer primary key in MySQL Table otherwise you have to
provide distkey
which should be integer and unique. This distkey
helps program to split data and process it in
parallel. If your table doesn't have any of these options then add "isSplittable": false
field to json so it will not
split table.
Note[3]- It is better to have an auto-incremental column in MySQL which should be only PRIMARY KEY because it will help application process data in parallel.
Non-Splittable:- If table is non-splittable then add "isSplittable": false
. But large table needs to be splitted otherwise it will
became impossible to migrate table in real-time. By default this field is false
if not passed.
[
{
"mysql": {...},
"tables": [
{
"name": "test",
"isSplittable": true
}
],
"redshift": {...},
"s3": {...}
}
]
Add Columns:- If you wanted additional column in redshift table with some processing which can done in SQL
compatible with Redshift. Suppose you want to extract year
and month
from created_at
column you can do it in
following way:-
[
{
"mysql": {...},
"tables": [
{
"name": "test",
"addColumn": "EXTRACT(year from created_at)::INT AS year, EXTRACT(month from created_at)::INT AS month"
}
],
"redshift": {...},
"s3": {...}
}
]
Please use the following support channels: