DBT was used to perform SQL transformations on our data, including changing column data types and renaming the columns. Additionally, the large staging table was split into fact and dimension tables as part of the data modeling process.
- Navigate to the
dbt
folder inside the project repo:cd crypto_analytics_engineering/dbt
- Edit the
profile.yml
file to specify your GCP project id and save - Also, navigate to the staging files directory:
cd crypto_analytics_engineering/dbt/models/staging
and edit thesource.yml
file to specify your GCP project id as the BigQuery databse name
In this project, the DBT job was configured to run within Airflow, allowing Airflow to orchestrate the entire extraction, loading, and transformation process. However, the DBT transformation job could also be separated from the extraction and loading, as DBT Cloud can schedule models and tests, execute them in the correct order, and send notifications upon failure, all without Airflow.
In practice, a better approach might involve first dockerizing the DBT transformations and pushing the image to a container registry such as Google Artifact Registry. Then, the KubernetesPodOperator in Airflow can pull the image and run the dockerized transformations. This method resolves dependency conflicts, isolates the code, and modularizes the infrastructure components. For more details on this approach, you can read the full article here.
- Generating a custom dbt schema. How to Change the Default Dataset Name Created by DBT in BigQuery from dbt_{user_name} to a Custom Name.
- Learn more about dbt in the docs or take a dbt course.