go-etl's datax is a data synchronization tool that currently supports data synchronization between mainstream relational databases such as MySQL, Postgres, Oracle, SQL Server, DB2, and file formats like CSV and XLSX.
You can download the 64-bit version of the binary program for Windows or Linux operating systems from the latest release.
Obtain the datax binary program from the downloaded package. On Linux, as shown in the Makefile, export LD_LIBRARY_PATH=/home/ibmdb/clidriver/lib. This library can be downloaded from ibm db2, otherwise it will not run.
Additionally, for Oracle, you need to download the corresponding 64-bit version of the ODBC dependency from Oracle.
Calling datax is straightforward; you can simply invoke it as follows:
data -c config.json
-c specifies the data source configuration file.
When the return value is 0 and it displays "run success," it indicates successful execution.
When the return value is 1 and it displays "run fail," it provides the reason for the failure.
The data source configuration file is a JSON file that combines data sources. For example, to synchronize data from MySQL to Postgres:
{
"core" : {
"container": {
"job":{
"id": 1,
"sleepInterval":100
}
}
},
"job":{
"content":[
{
"reader":{
"name": "mysqlreader",
"parameter": {
"username": "test:",
"password": "test:",
"column": ["*"],
"connection": {
"url": "tcp(192.168.15.130:3306)/source?parseTime=false",
"table": {
"db":"source",
"name":"type_table"
}
},
"where": ""
}
},
"writer":{
"name": "postgreswriter",
"parameter": {
"username": "postgres",
"password": "123456",
"writeMode": "insert",
"column": ["*"],
"preSql": [],
"connection": {
"url": "postgres://192.168.15.130:5432/postgres?sslmode=disable&connect_timeout=2",
"table": {
"schema":"destination",
"name":"type_table"
}
},
"batchTimeout": "1s",
"batchSize":1000
}
},
"transformer":[]
}
]
}
}
The configurations for reader
and writer
are as follows:
Type | Data Source | Reader (Read) | Writer (Write) | Documentation |
---|---|---|---|---|
Relational Database | MySQL/Mariadb/Tidb | √ | √ | Read, Write |
Postgres/Greenplum | √ | √ | Read, Write | |
DB2 LUW | √ | √ | Read, Write | |
SQL Server | √ | √ | Read, Write | |
Oracle | √ | √ | Read, Write | |
Sqlite3 | √ | √ | Read、Write | |
Unstructured Stream | CSV | √ | √ | Read, Write |
XLSX (excel) | √ | √ | Read, Write |
Note: On Linux, as indicated in the Makefile, export LD_LIBRARY_PATH=${DB2HOME}/lib
- Initialize the database using
cmd/datax/examples/mysql/init.sql
for testing purposes - Start the MySQL synchronization command:
datax -c examples/mysql/config.json
- Initialize the database using
cmd/datax/examples/postgres/init.sql
for testing purposes - Start the PostgreSQL synchronization command:
datax -c examples/postgres/config.json
- Before use, download the corresponding DB2 ODBC library, e.g.,
make dependencies
andrelease.bat
for Linux - Note: On Linux, as indicated in the Makefile, export
LD_LIBRARY_PATH=${DB2HOME}/lib
- Note: On Windows, as indicated in
release.bat
, setpath=%path%;%GOPATH%\src\github.com\ibmdb\go_ibm_db\clidriver\bin
- Initialize the database using
cmd/datax/examples/db2/init.sql
for testing purposes - Start the synchronization command:
datax -c examples/db2/config.json
- Before use, download the corresponding Oracle Instant Client. For example, it is recommended to download version 12.x to connect to Oracle 11g.
- Note: On Linux, export
LD_LIBRARY_PATH=/opt/oracle/instantclient_21_1:$LD_LIBRARY_PATH
. Additionally, installlibaio
. - Note: On Windows, set
path=%path%;%GOPATH%\oracle\instantclient_21_1
. Oracle Instant Client 19 no longer supports Windows 7. - Initialize the database using
cmd/datax/examples/oracle/init.sql
for testing purposes - Start the synchronization command:
datax -c examples/oracle/config.json
- Initialize the database using
cmd/datax/examples/sqlserver/init.sql
for testing purposes - Start the SQL Server synchronization command:
datax -c examples/sqlserver/config.json
- Initialize the database using
cmd/datax/examples/csvpostgres/init.sql
for testing purposes - Start the synchronization command:
datax -c examples/csvpostgres/config.json
- Initialize the database using
cmd/examples/datax/csvpostgres/init.sql
for testing purposes (Note: The path may need correction as it seems inconsistent with other examples) - Start the synchronization command:
datax -c examples/xlsxpostgres/config.json
- Initialize the database using
cmd/datax/examples/csvpostgres/init.sql
for testing purposes - Start the synchronization command:
datax -c examples/postgrescsv/config.json
- Initialize the database using
cmd/datax/examples/csvpostgres/init.sql
for testing purposes (Note: The initialization script may not be specific to XLSX synchronization) - Start the synchronization command:
datax -c examples/postgresxlsx/config.json
- Before use, download the corresponding SQLite Download Page.
- Note: On Windows, set
path=%path%;/opt/sqlite/sqlite3.dll
. - Initialize the database using
cmd/datax/examples/sqlite3/init.sql
for testing purposes - In
examples/sqlite3/config.json
,url
is the path of sqlite3 database files. On Windows, it can beE:\sqlite3\test.db
, meanwhile, on Linux, it can be/sqlite3/test.db
, - Start the sqlite3 synchronization command:
datax -c examples/sqlite3/config.json
In addition to the above examples, all data sources listed in the go-etl features can be used interchangeably. Configurations can be set up for data sources such as MySQL to PostgreSQL, MySQL to Oracle, Oracle to DB2, etc.
{
"job":{
"setting":{
"pool":{
"maxOpenConns":8,
"maxIdleConns":8,
"connMaxIdleTime":"40m",
"connMaxLifetime":"40m"
},
"retry":{
"type":"ntimes",
"strategy":{
"n":3,
"wait":"1s"
},
"ignoreOneByOneError":true
}
}
}
}
maxOpenConns
: Maximum number of open connectionsmaxIdleConns
: Maximum number of idle connectionsconnMaxIdleTime
: Maximum idle time for a connectionconnMaxLifetime
: Maximum lifetime of a connection
ignoreOneByOneError
: Whether to ignore individual retry errors
- Retry types and strategies:
- Type
ntimes
: Retry a fixed number of times. Strategy:"strategy":{"n":3,"wait":"1s"}
, wheren
is the number of retries andwait
is the waiting time between retries. - Type
forever
: Retry indefinitely. Strategy:"strategy":{"wait":"1s"}
, wherewait
is the waiting time between retries. - Type
exponential
: Exponential backoff retry. Strategy:"strategy":{"init":"100ms","max":"4s"}
, whereinit
is the initial waiting time andmax
is the maximum waiting time.
- Type
datax -c examples/global/config.json
It is assumed that data is evenly distributed based on the split key. Proper use of such a split key can make synchronization faster. Additionally, to speed up queries for maximum and minimum values, preset maximum and minimum values can be used for large tables.
- Generate MySQL data using a program and create
split.csv
cd cmd/datax/examples/split
go run main.go
- Use
init.sql
to create the table - Synchronize to the MySQL database
cd ../..
datax -c examples/split/csv.json
- Modify
examples/split/config.json
to set the split key asid,dt,str
- Synchronize MySQL data with integer, date, and string types using the split key
datax -c examples/split/config.json
preSql
and postSql
are sets of SQL statements executed before and after writing data, respectively.
In this example, a full import is used:
- Before writing data, a temporary table is created.
- After writing data, the original table is deleted, and the temporary table is renamed to the new table.
datax -c examples/prePostSql/config.json
Previously, the byte
and record
configurations for speed did not take effect. Now, with the introduction of flow control, byte
and record
will be effective. byte
limits the size of cached messages in bytes, while record
limits the number of cached messages. If byte
is set too low, it can cause the cache to be too small, resulting in failed data synchronization. When byte
is 0 or negative, the limiter will not work. For example, byte
can be set to 10485760, which is equivalent to 10Mb (10x1024x1024).
{
"job":{
"setting":{
"speed":{
"byte":10485760,
"record":1024,
"channel":4
}
}
}
}
- Generate
src.csv
using a program and initiate the flow control test
cd cmd/datax/examples/limit
go run main.go
cd ../..
datax -c examples/limit/config.json
The database reader uses querySql
to query the database.2.2 多任务数据同步
-
Configure the data source configuration file, such as syncing from MySQL to PostgreSQL:
{ "core" : { "container": { "job":{ "id": 1, "sleepInterval":100 } } }, "job":{ "content":[ { "reader":{ "name": "mysqlreader", "parameter": { "username": "test:", "password": "test:", "column": ["*"], "connection": { "url": "tcp(192.168.15.130:3306)/source?parseTime=false", "table": { "db":"source", "name":"type_table" } }, "where": "" } }, "writer":{ "name": "postgreswriter", "parameter": { "username": "postgres", "password": "123456", "writeMode": "insert", "column": ["*"], "preSql": [], "connection": { "url": "postgres://192.168.15.130:5432/postgres?sslmode=disable&connect_timeout=2", "table": { "schema":"destination", "name":"type_table" } }, "batchTimeout": "1s", "batchSize":1000 } }, "transformer":[] } ] } }
The source-destination configuration wizard file is a CSV file. Each row of configuration can be set as follows:
path[table],path[table]
Each column can be a path or table name. Note that all tables should have a configured database name or schema name, which needs to be configured in the data source configuration file.
datax -c tools/testData/xlsx.json -w tools/testData/wizard.csv
-c specifies the data source configuration file, and -w specifies the source-destination configuration wizard file.
The execution result will generate a set of configuration files in the data source configuration file directory, with the number of rows in the source-destination configuration wizard file. The configuration sets will be named as specified_data_source_config_file1.json, specified_data_source_config_file2.json, ..., specified_data_source_config_file[n].json.
Additionally, an execution script named run.bat or run.sh will be generated in the current directory.
run.bat
Linux
run.sh
You can run the test data in cmd/datax/testData:
cd cmd/datax datax -c testData/xlsx.json -w testData/wizard.csv
The result will generate a set of configuration files in the testData directory, with the number of rows in the wizard.csv file. The configuration sets will be named as xlsx1.json, xlsx2.json, ..., xlsx[n].json.
datax -h
Help display:
Usage of datax: -c string config (default "config.json") -http string http -w string wizard
-http adds a listening port, such as 8080. After enabling, access 127.0.0.1:8080/metrics to get real-time throughput.
datax version
Display:
version number
(git commit:git commit number
) compiled by go versiongo version number
v0.1.0 (git commit: c82eb302218f38cd3851df4b425256e93f85160d) compiled by go version go1.16.5 windows/amd64
datax -http :8443 -c examples/limit/config.json
Use a web browser to access http://127.0.0.1:8443/metrics to get the current monitoring data:
{"jobID":1,"metrics":[{"taskGroupID":0,"metrics":[{"taskID":0,"channel":{"totalByte":2461370,"totalRecord":128624,"byte":3820,"record":191}}]}]}
- totalByte: Total number of bytes synchronized
- totalRecord: Total number of records synchronized
- byte: Number of bytes synchronized in the channel
- record: Number of records synchronized in the channel