A data engineering project showcasing a data pipeline built using Apache Kafka and Apache Spark, implemented in Scala.
This project demonstrates a real-time data pipeline that ingests streaming data from Apache Kafka and processes it using Apache Spark. The pipeline is built using Scala and leverages the power of Spark Streaming for real-time data processing.
The following diagram illustrates the architecture of the data pipeline:
Click to expand the Mermaid diagram code
graph LR
A[Kafka Producer] -- Produces data --> B((Kafka))
B -- Consumes data --> C[Spark Streaming]
C -- Processes data --> D((Output Destination))
-
Ingests streaming data from Apache Kafka.
-
Real-time data processing using Apache Spark Streaming.
-
Flexible and scalable pipeline architecture.
-
Data processing operations like filtering, aggregation, and more.
-
Easily customizable for different use cases.
To run this data pipeline locally, follow these steps:
- Prerequisites:
- Apache Kafka: Install and set up Kafka.
- Apache Spark: Install and set up Spark.
-
Clone the repository:
git clone https://github.com/AnthonyByansi/Spark-Kafka-Data-Pipeline.git cd Spark-Kafka-Data-Pipeline -
Start the Kafka Producer:
- Update the Kafka producer code (
kafka/Producer.scala) with your desired data generation logic. - Follow the instructions in the
kafka/README.mdfile to run the Kafka producer.
- Run the Spark Streaming application:
- Update the Spark Streaming application code (
spark/StreamingApp.scala) as per your data processing requirements. - Follow the instructions in the
spark/README.mdfile to run the Spark Streaming application.
- Observe the data processing:
Monitor the output destination (e.g., file system, database, or another Kafka topic) to see the processed data.
Contributions are welcome! If you encounter any issues or have ideas for enhancements, feel free to open an issue or submit a pull request. Please follow the project's code of conduct.
