Skip to content

Pipeline with Airflow, write data into Amazon Redshift from S3 bucket

Notifications You must be signed in to change notification settings

kennycontreras/airflow-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow Pipeline

ETL created using Airflow. This project load data from S3 buckets and write the data into staging, fact and dimension tables. Using PostgresOperator and PythonOperator we are able to create Fact and Dimension tables for a star-schema.

This project runs data quality check into dimension and fact tables checking null values and empty tables.

Graph View

Main Dag

alt text

Schedule:

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2018, 11, 1),
    'end_date:': datetime(2018, 11, 30),
    'email_on_retry': False,
    'email_on_failure': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'depends_on_past': False
}

Beside default parameters above, this DAG runs hourly with a max 1 active run at the same time.

start_date context is used to load data from S3 (CSV files). All hooks are created with the goal of being flexible. This means that you can define is you want to delete or just append the data into dimension and fact tables. You can use different connection id too. Just make sure that the Hook fit your purpose

Subdag

alt text

Subdag created to write all dimension tables. The entire subdag use LocalExecutor() method to runs all taks at the same time.

Development

Want to contribute? Great! please feel free to open issues and push.

About

Pipeline with Airflow, write data into Amazon Redshift from S3 bucket

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages