This repo contains the final project implemented for the Data Engineering zoomcamp course.
The aim of this project is to analyse the hire-taxi data in newy york city for 2022. The project is developed intended to analyse the FHV(For hi re-vehicle) data and to get the answer of few questions as below:
- Different providers offering services in tax-hiring and their market share?
- Distribution of taxi hiring based on each month of 2022 and drill down based on service providers?
The NYC Taxi For Hire Vehicle(FHV) Data dataset is used. This dataset is updated monthly.
Details include information about the time, location and descriptive categorizations of the trip records for FHV taxi high volume data. To know more about the dataset click Here.
The following components were utilized to implement the required solution:
- Data Ingestion: Data extracted using python requests module using NYC Taxi data internal API
- Infrastructure as Code: Terraform
- Workflow orchestration: Airflow
- Data Lake: Google Cloud Storage
- Data Warehouse: Google BigQuery
- Data Pipeline: Spark batch processing
- Data Transformation: Spark via Google Dataproc
- Reporting: Google Looker Studio
- Install the below tools:
- Terraform
- Google Cloud SDK
- docker + docker-compose v2
-
In GCP, create a service principal with the following permissions:
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- Dataproc Admin
-
Download the service principal authentication file and save it as
$HOME/.google/credentials/google_credentials_project.json
. -
Ensure that the following APIs are enabled:
- Compute Engine API
- Cloud Dataproc API
- Cloud Dataproc Control API
- BigQuery API
- Bigquery Storage API
- Identity and Access Management (IAM) API
- IAM Service Account Credentials API
- Perform the following to set up the required cloud infrastructure
cd terraform
terraform init
terraform plan
terraform apply
cd ..
- Setup airflow to perform data ingestion
cd airflow
docker-compose build
docker-compose up airflow-init
docker-compose up -d
- Go to the aiflow UI at the web address
localhost:8080
and enable theFHVHV_DATA_ETL
dag. - This dag will ingest the month wise FHVHV data for year 2022, upload it to the data lake(GCS).
-
Install and setup spark Follow this.
-
Enable and run the
Spark_FHVHV_ETL
dag. -
This will intialize the below steps:
- Create a Dataproc cluster.
- Submit a spark code to GCS.
- submit a spark job to Dataproc for transformation and analysis and save processed data to GCS after partitioning.
- Save the processed data from GCS to BigQuery with clustering data based on month.
- Delete the DataProc cluster.
It can also be viewed at this link.