Spark-Kafka-Data-Pipeline

A data engineering project showcasing a data pipeline built using Apache Kafka and Apache Spark, implemented in Scala.

Overview

This project demonstrates a real-time data pipeline that ingests streaming data from Apache Kafka and processes it using Apache Spark. The pipeline is built using Scala and leverages the power of Spark Streaming for real-time data processing.

Architecture

The following diagram illustrates the architecture of the data pipeline:

Click to expand the Mermaid diagram code

graph LR
    A[Kafka Producer] -- Produces data --> B((Kafka))
    B -- Consumes data --> C[Spark Streaming]
    C -- Processes data --> D((Output Destination))

Features

Ingests streaming data from Apache Kafka.
Real-time data processing using Apache Spark Streaming.
Flexible and scalable pipeline architecture.
Data processing operations like filtering, aggregation, and more.
Easily customizable for different use cases.

Getting Started

To run this data pipeline locally, follow these steps:

Prerequisites:

Apache Kafka: Install and set up Kafka.
Apache Spark: Install and set up Spark.

Clone the repository: git clone https://github.com/AnthonyByansi/Spark-Kafka-Data-Pipeline.git cd Spark-Kafka-Data-Pipeline
Start the Kafka Producer:

Update the Kafka producer code (kafka/Producer.scala) with your desired data generation logic.
Follow the instructions in the kafka/README.md file to run the Kafka producer.

Run the Spark Streaming application:

Update the Spark Streaming application code (spark/StreamingApp.scala) as per your data processing requirements.
Follow the instructions in the spark/README.md file to run the Spark Streaming application.

Observe the data processing:

Monitor the output destination (e.g., file system, database, or another Kafka topic) to see the processed data.

Contributing

Contributions are welcome! If you encounter any issues or have ideas for enhancements, feel free to open an issue or submit a pull request. Please follow the project's code of conduct.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Spark-kafka-Data-pipeline		Spark-kafka-Data-pipeline
LICENSE		LICENSE
README.md		README.md
diagram.png		diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark-Kafka-Data-Pipeline

Overview

Architecture

Features

Getting Started

Contributing

About

Uh oh!

Releases

Packages

Languages

License

AnthonyByansi/Spark-Kafka-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Spark-Kafka-Data-Pipeline

Overview

Architecture

Features

Getting Started

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages