This project implements a comprehensive Extract-Load pipeline for airline review data, designed to process and analyze customer review data from Skytrax (AirlineQuality.com). The pipeline extracts 100,000+ reviews from 500+ airlines worldwide, leveraging Apache Airflow, Snowflake, AWS S3, and Docker to load data into Snowflake before transformation using dbt-.
.
├── airflow/ # Airflow configuration and DAGs
│ ├── dags/ # Airflow DAG definitions
│ ├── tasks/ # Custom task implementations
│ ├── plugins/ # Custom Airflow plugins
│ └── logs/ # Airflow execution logs
├── astronomer/ # Astronomer Directory
├── data/ # Data files
│ └── raw_data.csv # Source data file
├── docker/ # Docker configuration
│ ├── docker-compose.yaml
│ └── Dockerfile
├── .env # Environment variables
├── requirements.txt # Python dependencies
└── Makefile # Project automation commands
- Data Source: Skytrax (AirlineQuality.com) - 500+ airlines
- Data Processing: Python 3.12 with Pandas, BeautifulSoup, Requests
- Workflow Orchestration: Apache Airflow, Astronomer
- Data Warehouse: Snowflake
- Data Lake: AWS S3 for staging
- Containerization: Docker
The project processes comprehensive airline review data scraped from Skytrax (AirlineQuality.com), which contains detailed information about customer flight experiences across 500+ airlines worldwide.
-
Data Crawling
- Automatically discover and crawl customer reviews from all airlines listed on Skytrax
- Extract 100,000+ reviews from 500+ airlines including major carriers worldwide
- Store raw data with airline identification in
raw_data.csv - Features intelligent pagination and error handling for robust data collection
-
Data Cleaning & Transformation
- Process and clean the raw multi-airline data
- Standardize formats across different airline review structures
- Handle missing values and normalize rating systems
- Generate cleaned dataset with consistent schema
-
Staging in S3
- Upload cleaned multi-airline data to AWS S3 bucket (
upload_cleaned_data_to_s3) - Store in staging area for Snowflake ingestion
- Maintain data versioning and audit trail
- Support large-scale data processing with partitioning by airline
- Upload cleaned multi-airline data to AWS S3 bucket (
-
Snowflake Loading
- Use Snowflake COPY operator to load data from S3
- Transform and load into target tables with airline-specific schemas
- Implement incremental loading strategy for continuous data updates
- Support analytics across multiple airlines and comparative analysis
- Multi-airline data validation checks
- Comprehensive error handling and logging across all airline sources
- Pipeline monitoring and alerting for large-scale operations
- Snowflake data quality monitoring with airline-specific metrics
Located in airflow/dags/:
- DAG definitions for multi-airline data processing workflows
- Task scheduling and dependency management for large-scale operations
- Error handling and retry logic for reliable data extraction
- Snowflake data loading and transformation tasks for all airlines
Located in airflow/tasks/:
- Multi-airline scraper: Automated discovery and extraction from 500+ airlines
- Data processing: Transformation logic for diverse airline data formats
- S3 upload operations: Large-scale data transfer and storage
- Snowflake operations: Data loading and unloading for analytics
- Custom operators: Business logic for airline review processing
- Utility functions: Data handling for high-volume operations
Located in airflow/plugins/:
- Custom hooks and operators for airline data processing
- Extended Airflow functionality for web scraping at scale
- Integration with Snowflake and S3 services for big data workflows
pandas==1.5.3- Data processing and analysisrequests- HTTP library for web scrapingbeautifulsoup4- HTML parsing for airline review extractionapache-airflow-providers-snowflake- Snowflake integrationsnowflake-connector-python- Direct Snowflake connectivityboto3==1.35.0- AWS S3 operationsapache-airflow-providers-amazon- AWS services integration
- Comprehensive Coverage: Extracts reviews from 500+ airlines on Skytrax
- Scalable Architecture: Handles 100,000+ reviews with robust error handling
- Airline Identification: Each review tagged with airline name for analysis
- Intelligent Discovery: Automatically finds and processes all available airlines
- Production Ready: Containerized with Docker for reliable deployment