Skip to content

MsnzmT/Spotify-BigData-Streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify Big Data Streaming

Data Flow

📌 Overview

Spotify Big Data Management is a data pipeline project designed to process, transform, and analyze event data using big data tools and technologies. It follows a star schema approach, ensuring efficient data storage and retrieval for analytics.

⚙️ Tech Stack

  • Event Sim → Generates simulated event data and produces it to Kafka.
  • Kafka → Acts as a message broker to handle real-time event streaming.
  • Apache Spark → Consumes Kafka messages, processes them, and stores raw data in HDFS.
  • Hadoop HDFS → Stores raw and transformed data across different layers (Bronze, Silver, Gold).
  • dbt (Data Build Tool) → Transforms data in HDFS using the Spark adapter.
  • ClickHouse → A high-performance columnar database for data warehousing.
  • Metabase → A business intelligence tool to create visualizations and charts.

🔄 Data Processing Workflow

1️⃣ Ingestion & Storage (Bronze Layer)

  • The pipeline starts with Kafka, receiving raw event data generated by Event Sim.
  • Apache Spark consumes this data from Kafka and stores it in HDFS (Bronze Layer) in Parquet format.

2️⃣ Transformation (Silver Layer)

  • Using dbt, the raw data is cleaned, transformed, and structured into Fact and Dimension tables based on a Star Schema.
  • This processed data is stored in the HDFS Silver Layer.

3️⃣ Business-Ready Data (Gold Layer)

  • Further transformations are applied using dbt to create aggregated, business-ready data in the HDFS Gold Layer.

4️⃣ Data Warehousing & Analytics

  • The Gold Layer data is loaded into ClickHouse, enabling fast analytical queries.
  • Metabase is connected to ClickHouse to build insightful dashboards and visualizations.

🚀 Key Features

✔️ Real-time Data Streaming with Kafka
✔️ Scalable Data Storage using HDFS
✔️ Transformations with dbt following Star Schema
✔️ Fast Querying with ClickHouse
✔️ Intuitive Data Visualizations with Metabase

This project enables efficient end-to-end data management, from ingestion to analytics, making it a powerful solution for big data processing.

About

Spotify BigData Streaming is a real-time data streaming and analytics pipeline that processes event data using Kafka, Spark, and Hadoop HDFS. It follows a Star Schema approach, transforming raw data into structured formats with dbt and storing business-ready insights in ClickHouse. Finally, Metabase provides interactive visualizations for analytics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors