GitHub - roksolana-d/spark-streaming-examples: Research on legacy and structured streaming with Spark

A set of examples for streaming data processing using Apache Spark, Twitter source and Apache Kafka as data producer. ElasticSearch and Kibana are used to store and display the data (respectively).

The project consists of 2 parts:

Legacy streaming

Simple data flow from data producer (Kafka) to consumer (Spark). The data format is simple String which is further stored in ES.

Implementation plans:

~~Replace String with JSON format~~
~~Replace Kafka producer String format with JSON format for the legacy Spark streaming case~~

Structured streaming

The same flow from Twitter to Kafka and Spark. ELK stack is not used, the data is displayed on the console format (for now). Fields to be discovered:

~~Filtering and aggregations on data~~
~~Windowed aggregations~~
~~Watermarking~~
Data deduplication (IN PROGRESS)
Metrics gathering

Future plans:

~~Upgrade Spark to 2.4.0~~
Multiple streams joins
ELK stack introduction for Spark structured streaming case
Deploy the project with Kubernetes

Spark 2.4.0 is used with Scala version is current release (2.12.8)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src/main/scala		src/main/scala
twitter_configs		twitter_configs
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml
kafka-docker-compose.yml		kafka-docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

roksolana-d/spark-streaming-examples

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages