DataStreamingETL

Utilizing my background and love for Apache Airflow and data to build a real-time data streaming pipeline, covering each phase from data ingestion to processing and finally storage.

The system is built using Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra — all neatly containerized using Docker for an end to end project!

Architecture

The project is designed with the following components:
- Data Source: randomuser.me API to generate random user data for the pipeline.
- Apache Airflow: For orchestrating the pipeline and storing fetched data in a PostgreSQL database.
- Apache Kafka and Zookeeper: For streaming data from PostgreSQL to the processing engine.
- Control Center and Schema Registry: For monitoring and schema management of Kafka streams. Control Center listens for events on the Schema Registry to visualize data directly on Kafka that is managed by Zookeeper.
- Apache Spark: For data processing with master and worker nodes.
- Cassandra: Where the processed data will be stored.

Steps to run the project:

Clone the repository:

git clone https://github.com/prekshivyas/DataStreamingETL.git

Navigate to the project directory:

cd DataStreamingETL

Run Docker Compose to spin up the services:

docker-compose up

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dags		dags
image		image
script		script
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
spark_stream.py		spark_stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataStreamingETL

Architecture

Steps to run the project:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

prekshivyas/DataStreamingETL

Folders and files

Latest commit

History

Repository files navigation

DataStreamingETL

Architecture

Steps to run the project:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages