Spotify Big Data Streaming

📌 Overview

Spotify Big Data Management is a data pipeline project designed to process, transform, and analyze event data using big data tools and technologies. It follows a star schema approach, ensuring efficient data storage and retrieval for analytics.

⚙️ Tech Stack

Event Sim → Generates simulated event data and produces it to Kafka.
Kafka → Acts as a message broker to handle real-time event streaming.
Apache Spark → Consumes Kafka messages, processes them, and stores raw data in HDFS.
Hadoop HDFS → Stores raw and transformed data across different layers (Bronze, Silver, Gold).
dbt (Data Build Tool) → Transforms data in HDFS using the Spark adapter.
ClickHouse → A high-performance columnar database for data warehousing.
Metabase → A business intelligence tool to create visualizations and charts.

🔄 Data Processing Workflow

1️⃣ Ingestion & Storage (Bronze Layer)

The pipeline starts with Kafka, receiving raw event data generated by Event Sim.
Apache Spark consumes this data from Kafka and stores it in HDFS (Bronze Layer) in Parquet format.

2️⃣ Transformation (Silver Layer)

Using dbt, the raw data is cleaned, transformed, and structured into Fact and Dimension tables based on a Star Schema.
This processed data is stored in the HDFS Silver Layer.

3️⃣ Business-Ready Data (Gold Layer)

Further transformations are applied using dbt to create aggregated, business-ready data in the HDFS Gold Layer.

4️⃣ Data Warehousing & Analytics

The Gold Layer data is loaded into ClickHouse, enabling fast analytical queries.
Metabase is connected to ClickHouse to build insightful dashboards and visualizations.

🚀 Key Features

✔️ Real-time Data Streaming with Kafka
✔️ Scalable Data Storage using HDFS
✔️ Transformations with dbt following Star Schema
✔️ Fast Querying with ClickHouse
✔️ Intuitive Data Visualizations with Metabase

This project enables efficient end-to-end data management, from ingestion to analytics, making it a powerful solution for big data processing.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
clickhouse		clickhouse
dbt		dbt
docker		docker
docs		docs
hadoop		hadoop
spark/scripts		spark/scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify Big Data Streaming

📌 Overview

⚙️ Tech Stack

🔄 Data Processing Workflow

1️⃣ Ingestion & Storage (Bronze Layer)

2️⃣ Transformation (Silver Layer)

3️⃣ Business-Ready Data (Gold Layer)

4️⃣ Data Warehousing & Analytics

🚀 Key Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spotify Big Data Streaming

📌 Overview

⚙️ Tech Stack

🔄 Data Processing Workflow

1️⃣ Ingestion & Storage (Bronze Layer)

2️⃣ Transformation (Silver Layer)

3️⃣ Business-Ready Data (Gold Layer)

4️⃣ Data Warehousing & Analytics

🚀 Key Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages