This repo is a hands-on lab for streaming from Kafka on Confluent Cloud into BigQuery, with Apache Spark Structured Streaming on Dataproc Serverless Spark. It strives to demystify the products showcased and is less about building a perfect streaming application. It features a minimum viable example of joining a stream from Kafka with a static source in BigQuery, and sinking to BigQuery.
Data engineers
- Access to Google Cloud and Confluent Kafka
- Basic knowledge of Google Cloud services featured in the lab, Kafka and Spark is helpful
1 hour from start to completion
< $100
- Just enough knowlege of Confluent Kafka on GCP for streaming
- Just enough knowlege of Dataproc Serverless for Spark
- Just enough Terraform that can be repurposed for your use case
- Quickstart code that can be repurposed for your use case
About Dataproc Serverless Spark Batches:
Fully managed, autoscalable, secure Spark jobs as a service that eliminates administration overhead and resource contention, simplifies development and accelerates speed to production. Learn more about the service here.
- Find templates that accelerate speed to production here
- Want Google Cloud to train you on Serverless Spark for free, reach out to us here
- Try out our other Serverless Spark centric hands-on labs here
Note: The above notebook environment is not covered in this lab, but is showcased in our Spark MLOps lab.
The use case is basic sales and marketing campaign and promotion centric. Assume users logging on to a website and their data streamed to Kafka, and automatically entered into promotions/lotto for a trip.
# | Module |
---|---|
Module 1 | Provision the Google Cloud environment with Terraform |
Module 2 | Provision the Confluent Cloud environment |
Module 3 | Publish events to Kafka |
Module 4 | Spark Structured Streaming Kafka consumer - basic |
Module 5 | Spark Structured Streaming Kafka consumer - join with static data |
Shut down/delete resources when done to avoid unnecessary billing.
# | Collaborators | Company | Contribution |
---|---|---|---|
1. | Anagha Khanolkar | Google Cloud | Author of Spark application |
2. | Elena Cuevas | Confluent | Lab vision & Kafka producer code |
Community contribution to improve the lab is very much appreciated.
If you have any questions or if you found any problems with this repository, please report through GitHub issues.