This project demonstrates a complete data pipeline for extracting, transforming, and loading (ETL) Reddit data into an Amazon Redshift data warehouse. The pipeline uses various AWS services and tools including Apache Airflow, PostgreSQL, AWS S3, AWS Glue, AWS Athena, and Amazon Redshift. The project is orchestrated using Docker and Apache Airflow to ensure a smooth workflow and ease of deployment.
- Reddit API: The source of the data. Reddit data is extracted using the Reddit API.
- Apache Airflow: Used for orchestration of the data pipeline. Airflow manages the execution of tasks and ensures data flows through the pipeline.
- PostgreSQL: Used as the metadata database for Apache Airflow.
- Celery: Used for distributed task queueing to handle asynchronous tasks.
- Docker: Containers used for packaging and deploying the services.
- S3 Buckets:
- Raw Storage: Stores raw data from Reddit.
- Transformed Storage: Stores transformed data ready for further processing and querying.
- AWS Glue:
- Data Catalog: Maintains metadata of the datasets stored in S3.
- Crawlers: Crawls data from S3 and populates the Data Catalog.
- ETL (Extract, Transform, Load): Transforms and loads data from S3 to Redshift.
- Amazon Athena: Used for querying data stored in S3 using SQL.
- Amazon Redshift: A data warehouse where the final transformed data is stored for analysis.
- AWS IAM: Manages access and permissions for AWS services.
- Power BI
- Amazon QuickSight
- Tableau
- Looker Studio
- Extraction: Reddit data is extracted using the Reddit API and saved to the raw storage S3 bucket.
- Transformation: Data is processed using AWS Glue, transforming it into a structured format.
- Loading: The transformed data is loaded into Amazon Redshift for analysis.