Ground to cloud data integration tool
Used for ad-hoc query data results load from PostgreSQL to Amazon-Redshift.
Works from Windows CLI (command line).
NOTE: This script requires Windows PostgreSql client (bin\psql.exe) for data extract.
Features:
- Loads PostgreSQL table (or query) data to Amazon-Redshift.
- Automatically uploads compressed data to S3 before append to Redshift table.
- AWS Access Keys are not passed as arguments.
- You can modify default Python extractor.py and loader.py
- Written using Python/boto/psycopg2/PyInstaller.
- Oracle -> S3 data loader| TableHunter for Oracle | DataWorm
- Oracle -> Redshift data loader
- MySQL -> Redshift data loader
- CSV -> Redshift data loader
- EC2 Metcics Plotter
- Oracle->Oracle data loader.
- Oracle->MySQL data loader.
- CSV->S3 data uploader.
- Stream/pipe/load PostgreSQL table data to Amazon-Redshift.
- Tool connects to source PostgreSQL DB and opens data pipe for reading.
- Data stream is compressed and pumped to S3 using multipart upload.
- Optional upload to Reduced Redundancy storage (not RR by default).
- Optional "make it public" after upload (private by default).
- If S3 bucket doesn't exists it will be created.
- You can control the region where new bucket is created.
- Streamed data can be tee'd (dumped on disk) during load.
- If not set, S3 Key defaulted to input query file name.
- Data is loaded to Redshift from S3 using COPY command
- Target Redshift table has to exist
- It's a Python/boto/psycopg2 script
- Boto S3 docs: http://boto.cloudhackers.com/en/latest/ref/s3.html
- psycopg2 docs: http://initd.org/psycopg/docs/
- Executable is created using [pyInstaller] (http://www.pyinstaller.org/)
Database/ETL developers, Data Integrators, Data Engineers, Business Analysts, AWS Developers, SysOps
##Designated Environment Pre-Prod (UAT/QA/DEV)
c:\Python35-32\PROJECTS\PostgreSQL2redshift>dist\PostgreSQL_to_Redshift_loader.exe
#############################################################################
#PostgreSQL-to-Redshift Data Loader (v1.2, beta, 04/05/2016 15:11:53) [64bit]
#Copyright (c): 2016 Alex Buzunov, All rights reserved.
#Agreement: Use this tool at your own risk. Author is not liable for any damages
# or losses related to the use of this software.
################################################################################
Usage:
set AWS_ACCESS_KEY_ID=test_key
set AWS_SECRET_ACCESS_KEY=test_secret_key
set PGPASSWORD=test123
set PGRES_CLIENT_HOME="C:\Program Files\PostgreSQL\9.5"
set REDSHIFT_CONNECT_STRING="dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"
postgresql_to_redshift_loader.exe [<pgres_query_file>] [<pgres_col_delim>] [<pgres_add_header>]
[<s3_bucket_name>] [<s3_key_name>] [<s3_use_rr>] [<s3_public>]
--pgres_query_file -- SQL query to execure in source PostgreSQL db.
--pgres_col_delim -- CSV column delimiter for downstream(,).
--pgres_quote -- Enclose values in quotes (")
--pgres_add_header -- Add header line to CSV file (False).
--pgres_lame_duck -- Limit rows for trial upload (1000).
--create_data_dump -- Use it if you want to persist streamed data on your filesystem.
--s3_bucket_name -- S3 bucket name (always set it).
--s3_location -- New bucket location name (us-west-2)
Set it if you are creating new bucket
--s3_key_name -- CSV file name (to store query results on S3).
if <s3_key_name> is not specified, the PostgreSQL query filename (ora_query_file) will be used.
--s3_use_rr -- Use reduced redundancy storage (False).
--s3_write_chunk_size -- Chunk size for multipart upoad to S3 (10<<21, ~20MB).
--s3_public -- Make uploaded file public (False).
--red_to_table -- Target Amazon-Redshit table name.
--red_col_delim -- CSV column delimiter for upstream(,).
--red_quote -- Set it if input values are quoted.
--red_timeformat -- Timestamp format for Redshift ('MM/DD/YYYY HH12:MI:SS').
--red_ignoreheader -- skip header in input stream
PostgreSQL data uploaded to S3 is always compressed (gzip).
Boto S3 docs: http://boto.cloudhackers.com/en/latest/ref/s3.html
psycopg2 docs: http://initd.org/psycopg/docs/
Set the following environment variables (for all tests):
set AWS_ACCESS_KEY_ID=test_key
set AWS_SECRET_ACCESS_KEY=test_secret_key
set PGPASSWORD=test123
set PGRES_CLIENT_HOME="C:\Program Files\PostgreSQL\9.5"
set REDSHIFT_CONNECT_STRING="dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"
PostgreSQL table crime_test
contains data from data.gov Crime dataset.
In this example complete table crime_test
get's uploaded to Aamzon-S3 as compressed CSV file.
Contents of the file table_query.sql:
SELECT * FROM crime_test;
Also temporary dump file is created for analysis (by default there are no files created)
Use -s, --create_data_dump
to dump streamed data.
If target bucket does not exists it will be created in user controlled region.
Use argument -t, --s3_location
to set target region name
Contents of the file test.bat:
dist-64bit\PostgreSQL_to_redshift_loader.exe ^
-q table_query.sql ^
-d "," ^
-b test_bucket ^
-k postgresql_table_export ^
-r ^
-o crime_test ^
-m "DD/MM/YYYY HH12:MI:SS" ^
-s
Executing test.bat
:
c:\Python35-32\PROJECTS\PostgreSQL2redshift>dist-64bit\PostgreSQL_to_redshift_loader.exe -q table_query.sql -d "," -b test_bucket -k postgresql_table_export -r -o crime_test -m "DD/MM/YYYY HH12:MI:SS" -s
Uploading results of "table_query.sql" to existing bucket "test_bucket"
Started reading from PostgreSQL (1.25 sec).
Dumping data to: c:\Python35-32\PROJECTS\Ora2redshift\data_dump\table_query\test_bucket\postgresql_table_export.20160408_203221.gz
1 chunk 10.0 MB [11.36 sec]
2 chunk 10.0 MB [11.08 sec]
3 chunk 10.0 MB [11.14 sec]
4 chunk 10.0 MB [11.12 sec]
5 chunk 877.66 MB [0.96 sec]
Size: Uncompressed: 40.86 MB
Size: Compressed : 8.95 MB
Elapsed: PostgreSQL+S3 :69.12 sec.
Elapsed: S3->Redshift :3.68 sec.
--------------------------------
Total elapsed: 72.81 sec.
You can modify default Redshift COPY command this script is using.
Open file include\loader.py
and modify sql
variable on line 24.
sql="""
COPY %s FROM '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
GZIP
%s
%s;
COMMIT;
...
https://github.com/pydemo/teardown