In today's data-driven world, data plays a pivotal role in shaping decisions within organizations. The sheer volume of data generated necessitates data engineers to centralize data efficiently, clean and model data to align with specific business requirements, and also make the data easily accessible for data consumers.
The aim of this project is to build an automated data pipeline that retrieves cryptocurrency data from the CoinCap API, processes and transforms it for analysis, and presents key metrics on a near-real-time* dashboard. The dashboard provides users with valuable insights into the dynamic cryptocurrency market.
*near-real-time because the data is loaded from the source and processed every 5 minutes rather than instantly
This data used in this project was obtained from the CoinCap API, which provides real-time pricing and market activity for over 1,000 cryptocurrencies.
- Cloud: Google Cloud Platform (GCP)
- Infrastructure as Code (Iac): Terraform
- Containerization: Docker, Docker Compose
- Workflow Orchestration: Apache Airflow
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: Big Query
- Data Transformation: Data Build Tool (DBT)
- Visualization: Looker Studio
- Programming Language: Python (batch processing), SQL (data transformation)
Project Map:
- Provisioning Resources: Terraform is used to set up the necessary GCP resources, including a Compute Engine instance, GCS bucket, and BigQuery datasets
- Data Extraction: Every 5 minutes, JSON data is retrieved from the CoinCap API and converted to Parquet format for optimized storage and processing
- Data Loading: The converted data is stored in Google Cloud Storage, the data lake, and then loaded into BigQuery, the data warehouse.
- Data Transformation: DBT is connected to BigQuery to transform the raw data, after which the processed data is loaded back into BigQuery; with the entire ELT process automated and orchestrated using Apache Airflow
- Reporting: The transformed dataset is used to create an analytical report and visualizations in Looker Studio
Disclaimer: This is only a pet project. Please, do not use this dashboard for actual financial decisions. T for thanks!
Below are steps on how to reproduce this pipeline in the cloud. Note, that, Windows/WSL/Gitbash was used locally for this project.
- If you don't have a GCP account already, create a free trial account (you get free $300 credits) by following the steps in this guide
- Create a new project on GCP (see guide) and take note of your Project ID, as it will be needed at the later stages of the project
- Next is to enable necessary APIs for the project, create and configure a service account, and generate an auth-key. While all of these can be done via the GCP Web UI (see), Terraform will be used to run the processes (somebody say DevOps, hehehe). So skip for now.
- If you haven't already, download and install the Google Cloud SDK for local setup. You can follow this installation guide.
- You might need to restart your system before gcloud can be used via CLI. Check if the installation is successful by running
gcloud -v
in your terminal to view the version of the gcloud installed - Run
gcloud auth login
to authenticate the Google Cloud SDK with your Google account
- You might need to restart your system before gcloud can be used via CLI. Check if the installation is successful by running
The SSH Key will be used to connect and gain access to the gcp virtual machine via the local terminal (Linux). In your terminal run the command
ssh-keygen -t rsa -f ~/.ssh/<whatever-you-want-to-name-your-key> -C <the-username-that-you-want-on-your-VM> -b 2048
ex: ssh-keygen -t rsa -f ~/.ssh/ssh_key -C aayomide -b 2048
Follow the terraform reproduce guide
Create a file called config
within the .ssh directory in your home folder and paste the following information:
HOST <vm-name-to-use-when-connecting>
Hostname <external-ip-address> # check the terraform output in the CLI or navigate to GCP > Compute Engine > VM instances.
User <username used when running the ssh-keygen command> # it is also the same as the gce_ssh_user
IdentityFile <absolute-path-to-your-private-ssh-key-on-local-machine>
LocalForward 8080 localhost:8080 # forward traffic from local port 8080 to port 8080 on the remote server where Airflow is running
LocalForward 8888 localhost:8888 # forward traffic from local port 888 to port 8888 on the remote server where Jupyter Notebook is running
for example
HOST cryptolytics_vm
Hostname 35.225.33.44
User aayomide
IdentityFile c:/Users/aayomide/.ssh/ssh_key
LocalForward 8080 localhost:8080
LocalForward 8888 localhost:8888
Afterward, connect to the virtual machine via your local terminal by running ssh cryptolytics_vm
.
You can also access the VM via VS code as shown here
Note: the value of the external IP address changes as you turn the VM instance on and off
Follow the dbt how-to-reproduce guide
Follow the airflow how-to-reproduce guide
- Log in to Looker Studio using your google account
- Click on "Blank report" and select the "BigQuery" data connector
- Choose your data source (project -> dataset), which in this case is "prod_coins_dataset"
- Use Apache Kafka to stream the data in real-time
- Perform advanced data transformation using DBT or even PySpark
- Implement more robust error handling with try-catch blocks and write more robust data quality tests in DBT
- Pipeline alerting & monitoring feature
-
Michael Shoemaker animated dataflow architecture youtube tutorial
- Instead of installing the Peek screen recorder via the terminal as done in the video, I downloaded the app from the Microsoft store here