Skip to content

redshift-tools/redshift-csv-loader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSV File Loader for Amazon Redshift DB.

Ground to cloud data integration tool.
Loads CSV file to Amazon-Redshift table from command line.

Features:

  • Loads CSV file to Amazon Redshift.
  • Script preloads your data to S3 prior to insert to Redshift.
  • No need for Amazon AWS CLI.
  • Python script will work on Linux and Windows.
  • COPY command configurable via loader script
  • It's executable (csv_loader_for_redshift.exe) - no need for Python install.
  • It will work on any vanilla DOS for 64bit Windows.
  • AWS Access Keys are not passed as arguments.
  • Written using Python/boto/psycopg2
  • Compiled using PyInstaller.

Purpose

  • Ad-hoc CSV file load to Amazon Redshift table.

How it works

Audience

Database/ETL developers, Data Integrators, Data Engineers, Business Analysts, AWS Developers, DevOps,

Designated Environment

Pre-Prod (UAT/QA/DEV)

Usage

>dist-64bit\csv_loader_for_redshift.exe
#############################################################################
#CSV-to-Redshift Data Loader (v1.2, beta, 04/05/2016 15:11:53) [64bit]
#Copyright (c): 2016 Alex Buzunov, All rights reserved.
#Agreement: Use this tool at your own risk. Author is not liable for any damages
#           or losses related to the use of this software.
################################################################################
Usage:
  set AWS_ACCESS_KEY_ID=<you access key>
  set AWS_SECRET_ACCESS_KEY=<you secret key>
  set REDSHIFT_CONNECT_STRING="dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"
  csv_loader_for_redshift.exe <file_to_transfer> <bucket_name> [<use_rr>] [<public>]
                                                 [<delim>] [<quote>] [<to_table>] [<gzip_source_file>]

        --use_rr -- Use reduced redundancy storage (False).
        --public -- Make uploaded files public (False).
        --delim  -- CSV file delimiter (',').
        --quote  -- CSV quote ('"').
        --to_table  -- Target Amazon-Redshit table name.
        --timeformat -- timestamp format (MM/DD/YYYY HH12:MI:SS)
        --ignoreheader -- numbers of leading lines to ignore (0)
        --gzip_source_file  -- gzip input CVS file before upload to Amazon-S3 (False).

        Input filename will be used for S3 key name.

        Boto S3 docs: http://boto.cloudhackers.com/en/latest/ref/s3.html
        psycopg2 docs: http://initd.org/psycopg/docs/

"""

Example

Environment variables

  • Set the following environment variables (for all tests:
set AWS_ACCESS_KEY_ID=<you access key>
set AWS_SECRET_ACCESS_KEY=<you secret key>

set REDSHIFT_CONNECT_STRING="dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"  

CSV file upload into Redshift table test

cd c:\Python35-32\PROJECTS\csv2redshift
csv_loader_for_redshift.exe Crime.csv testbucket ^
	-r ^
	-p ^
	-d "," ^
	-t test ^
	-z ^
	-i 1

Result

S3        | Crime.csv.gz | 100%
Redshift  | test       | DONE
Time elapsed: 45.7 seconds

Controlling timestamp format

Use -m/--timeformat "MM/DD/YYYY HH12:MI:SS" to control timestamp format.

Skipping the header

Use -i/--ignoreheader 1 to set number of lines to ignore in input file.

Target Redshift table DDL

drop table test;
create table test (
Incident_ID VARCHAR(20),CR_Number VARCHAR(20),Dispatch_Date_Time TIMESTAMP,Class VARCHAR(10) ,Class_Description VARCHAR(100),
Police_District_Name VARCHAR(40),Block_Address VARCHAR(100),
City VARCHAR(40),State VARCHAR(8),Zip_Code VARCHAR(10),
Agency VARCHAR(40),Place VARCHAR(40),Sector VARCHAR(10) ,Beat VARCHAR(10),PRA VARCHAR(10),Start_Date_Time TIMESTAMP,End_Date_Time TIMESTAMP,
Latitude VARCHAR(20),Longitude VARCHAR(20),Police_District_Number VARCHAR(50),Location VARCHAR(80),Address_Number VARCHAR(30));

Test data

Modifying default script loader.

You can modify default Redshift COPY command this script is using.

Open file include\loader.py and modify sql variable on line 24.

	sql="""
COPY %s FROM '%s' 
	CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' 
	DELIMITER '%s' 
	FORMAT CSV %s 
	GZIP 
	%s 
	%s; 
	COMMIT;
	...

Teardown

https://github.com/pydemo/teardown

Other scripts

:-))

About

CSV loader for Amazon Redshift.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages