This project demonstrates the design and implementation of a modern, end-to-end data platform on the cloud. It ingests raw NYC Taxi trip data, transforms it using a robust analytics engineering workflow, models it according to the Medallion Architecture, and orchestrates the entire pipeline for production, making it ready for business intelligence and analysis.
The platform follows a modern ELT (Extract, Load, Transform) paradigm, leveraging the strengths of cloud-native services for scalability and performance.
Data Flow:
Raw Data (Parquet) -> GCS (Data Lake) -> Databricks (Compute & Lakehouse) -> dbt Cloud (Transformation & Orchestration) -> Gold Layer (BI-Ready Tables)
This project goes beyond a standard tutorial by incorporating real-world challenges and modern data engineering best practices.
-
Handling Evolving Data Schemas: The project uses the latest NYC Taxi dataset (2025 data), which includes new columns like
cbd_congestion_feeandairport_fee. The pipeline is designed to be resilient to these changes using schema evolution techniques. -
Adapting to New Business Rules: The source data introduced a new
payment_typecode (0for "Flex Fare trip"). The dbt models and macros were updated to correctly interpret and document this new business logic, demonstrating adaptability. -
Professional Project Structure (Medallion Architecture): The Databricks environment is organized using separate
prodanddevcatalogs. The production data flows through a structured Medallion Architecture:- Bronze (
prod.bronze): Raw, untouched data loaded from the data lake. - Silver (
prod.silver): Cleaned, conformed, and integrated data (dimensions and facts). - Gold (
prod.gold): Business-ready, aggregated data marts for analytics.
- Bronze (
-
CI/CD Automation: The project is configured for production using dbt Cloud. A production job runs on a schedule to refresh the data, and a Continuous Integration (CI) job automatically tests any new code changes in a pull request before they are merged.
-
Advanced dbt Customization: The project utilizes custom macros to manage environment-aware schema generation, ensuring a clean and isolated development workflow while maintaining a structured production environment.
This project utilizes the publicly available NYC Taxi and Limousine Commission (TLC) Trip Record Data.
- Dataset: NYC TLC Trip Record Data
- Data Used:
- Green Taxi Trip Records: Parquet files for Jan-Mar 2025.
- Yellow Taxi Trip Records: Parquet files for Jan-Mar 2025.
- Taxi Zone Lookup Table: A CSV file mapping
LocationIDto boroughs and zone names.
The dataset is rich, containing millions of records with details on trip durations, distances, fares, locations, and payment types. For this project, the raw Parquet and CSV files were initially staged in a Google Cloud Storage (GCS) bucket, which served as the data lake.
During development, several real-world challenges were encountered and solved, showcasing key data engineering skills.
- Problem: Initial queries failed because of case-sensitivity issues (
"VendorID"vs.vendorid) and headers being read as data. - Solution: The problem was fixed at the root by standardizing all column names to
snake_casein theCREATE TABLEDDL. TheCOPY INTOcommand was updated with the'header' = 'true'option to explicitly ignore header rows, ensuring a clean and predictable schema for the Bronze layer.
- Problem: The
uniquetest on thefact_tripsprimary key (tripid) was failing. The surrogate key(vendorid, pickup_datetime)was not unique when combining Green and Yellow taxi data. - Solution: The business key was enriched to be globally unique by adding
service_type. The new key,(vendorid, pickup_datetime, service_type), guarantees that a trip from a Green taxi and a Yellow taxi at the same time will have a uniquetripid.
- Problem: By default, dbt would create
goldtables in the development environment (dev.dbt_bistp_gold), cluttering the dev space. - Solution: A custom
generate_schema_namemacro was implemented. This macro intelligently forces all development models into a single dev schema (dev.dbt_bistp) while allowing production models to be built in their correct Medallion layers (prod.silver,prod.gold).
Snippet: macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if target.database == 'dev' -%}
{# In dev, ALWAYS use the default schema, ignoring any custom schema #}
{{ default_schema }}
{%- else -%}
{# In prod, respect the custom schema config (e.g., 'gold') #}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name }}
{%- endif -%}
{%- endif -%}
{%- endmacro %}The Directed Acyclic Graph (DAG) shows the flow of data and dependencies from the raw sources to the final data mart.
The project follows a standard, scalable dbt project structure.
├── dbt_project.yml
├── packages.yml
├── macros
│ ├── generate_schema_name.sql
│ ├── get_payment_type_description.sql
│ └── macros_properties.yml
├── models
│ ├── core
│ │ ├── dim_zones.sql
│ │ ├── dm_monthly_zone_revenue.sql
│ │ ├── fact_trips.sql
│ │ └── schema.yml
│ └── staging
│ ├── schema.yml
│ ├── stg_green_tripdata.sql
│ └── stg_yellow_tripdata.sql
└── seeds
├── seeds_properties.yml
└── taxi_zone_lookup.csv
The project is fully automated using dbt Cloud jobs.
-
Production Job:
- Trigger: Runs on a daily schedule.
- Command:
dbt build --vars '{is_test_run: false}' - Action: Performs a full refresh of the data warehouse, running all seeds, models, and tests to ensure data quality and freshness. It also generates the project documentation.
-
Continuous Integration (CI) Job:
- Trigger: Runs automatically on every pull request to the
mainbranch. - Command:
dbt build --select state:modified+ --defer --state target/ - Action: Intelligently builds and tests only the modified models and their downstream dependencies, providing rapid feedback and preventing broken code from being merged into production.
- Trigger: Runs automatically on every pull request to the
To replicate this project, follow these steps:
-
Prerequisites:
- A GCP account (for GCS and IAM credentials).
- A Databricks account (Community Edition is sufficient).
- A dbt Cloud account (Developer plan is sufficient).
- A GitHub account.
-
Clone the Repository:
git clone https://github.com/bISTP/nyc-taxi-databricks-dbt
-
Cloud Setup:
- Create a GCS/S3 bucket and upload the raw NYC Taxi parquet files and the
taxi_zone_lookup.csv. - In Databricks, create the
prodanddevcatalogs. - Run the initial SQL scripts to create the Bronze layer tables in
prod.bronzeand load the data using theCOPY INTOcommand.
- Create a GCS/S3 bucket and upload the raw NYC Taxi parquet files and the
-
dbt Cloud Configuration:
- Create a new dbt Cloud project and connect it to your GitHub repository.
- Configure the connection to your Databricks workspace.
- Set up a "Development" environment pointing to your
devcatalog. - Set up a "Production" environment pointing to your
prodcatalog andsilverschema.
-
Run dbt:
# Install dependencies dbt deps # Load seed data dbt seed # Run and test all models dbt build
The data is now modeled, tested, and ready for consumption. The next logical steps are:
-
Initial Exploration in Databricks: Use Databricks Notebooks to perform initial data analysis, query the Gold layer tables, and generate preliminary visualizations to uncover trends.
-
Business Intelligence with Power BI: Connect Power BI directly to the Databricks SQL Warehouse. The
prod.gold.dm_monthly_zone_revenuetable is optimized for this purpose and can be used to build a comprehensive, interactive dashboard for business users to explore revenue trends, service type performance, and zone-based metrics.


