This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.
- Data Source: We use Binance API to get the average price for 6 popular cryto currencies
- Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
- Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
- Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
- Apache Spark: For data processing with its master and worker nodes.
- Cassandra: Where the processed data will be stored.
- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL
- Docker
- Clone the github repo
git clone https://github.com/nhattan040102/Crypto_Market_Streaming_Data_pipeline.git
- Navigate to the current project
cd Crypto_Market_Streaming_Data_pipeline
- Build image from Dockerfile
docker build -t my_airflow_img .
- Run docker-compose to start the service
docker compose up -d --build