Spotify Big Data Management is a data pipeline project designed to process, transform, and analyze event data using big data tools and technologies. It follows a star schema approach, ensuring efficient data storage and retrieval for analytics.
- Event Sim → Generates simulated event data and produces it to Kafka.
- Kafka → Acts as a message broker to handle real-time event streaming.
- Apache Spark → Consumes Kafka messages, processes them, and stores raw data in HDFS.
- Hadoop HDFS → Stores raw and transformed data across different layers (Bronze, Silver, Gold).
- dbt (Data Build Tool) → Transforms data in HDFS using the Spark adapter.
- ClickHouse → A high-performance columnar database for data warehousing.
- Metabase → A business intelligence tool to create visualizations and charts.
- The pipeline starts with Kafka, receiving raw event data generated by Event Sim.
- Apache Spark consumes this data from Kafka and stores it in HDFS (Bronze Layer) in Parquet format.
- Using dbt, the raw data is cleaned, transformed, and structured into Fact and Dimension tables based on a Star Schema.
- This processed data is stored in the HDFS Silver Layer.
- Further transformations are applied using dbt to create aggregated, business-ready data in the HDFS Gold Layer.
- The Gold Layer data is loaded into ClickHouse, enabling fast analytical queries.
- Metabase is connected to ClickHouse to build insightful dashboards and visualizations.
✔️ Real-time Data Streaming with Kafka
✔️ Scalable Data Storage using HDFS
✔️ Transformations with dbt following Star Schema
✔️ Fast Querying with ClickHouse
✔️ Intuitive Data Visualizations with Metabase
This project enables efficient end-to-end data management, from ingestion to analytics, making it a powerful solution for big data processing.
