Skip to content

Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service

License

Notifications You must be signed in to change notification settings

GoogleCloudPlatform/serverless-spark-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Serverless Spark Solution Accelerators

Apache Spark is often used for interactive queries, machine learning, and real-time workloads.

Spark developers are typically spending only 40% of time writing code while spending 60% tuning infrastructure and managing clusters. There's a better way.

Google Cloud customers have used our auto-scaling, serverless Spark to boost productivity and reduce infrastructure costs.

This repository contains Serverless Spark on GCP solution accelerators built around common use cases - helping data engineers and data scientists with Apache Spark experience ramp up faster on Serverless Spark on GCP.

Feedback From Serverless Spark Users

  • "Our use case is to optimize retail assortment from 500M+ items. Serverless Spark enables us to only use the compute resources when we need them and all with a single click."
    ~ Dataproc customer who set up these production pipelines in one week

  • “Job that took 90 minutes on a manually tuned cluster took 19 minutes to finish with Serverless Spark. Every new data pipeline will start on Serverless Spark.”
    ~ Principal Architect at multinational retail corporation

What's Covered?

# Solution Accelerators Focus Feature Contributed By
0 Serverless Spark Project Setup Prerequisites Terraform TEKsystems
1 Telco Anomaly Detection Data Engineering Rules based processing to detect defective cell towers requiring maintenance via Serverless Spark Batch + BigLake to create GCS external tables in PARQUET and CSV + dbt to implement a data pipeline + Terraform to deploy required cloud infrastructure TEKsystems and Anagha Khanolkar then refactored by Luis Velasco to include BigLake, dbt, and Terraform
2 Retail Store Analytics Data Analysis Analysis of retail data to identify product sales, and recommend product aisles and inventory via Serverless Spark Batch from CLI with Cloud Composer orchestration and Dataproc Metastore TEKsystems
3 Pandemic Economic Impact Data Analysis Vertex AI notebooks with Serverless Spark session TEKsystems
4 Time Series Forecasting of Sales Data Analysis Vertex AI notebooks with Serverless Spark session TEKsystems
5 Real-Time Streaming of Customer Invoices Spark Streaming Serverless Spark Dataproc Batches TEKsystems
6 Malware Detection Data Analysis Serverless Spark Batch from CLI with Cloud Composer orchestration TEKsystems
7 Social Media Data Analytics Data Analysis Vertex AI notebooks with Serverless Spark session TEKsystems
8 Telco Anomaly Detection (with Dataproc UI Instructions) Data Engineering Serverless Spark Batch from CLI with Cloud Composer orchestration, Dataproc UI instructions, and the Persistent History Server (for viewing completed and running Spark jobs) TEKsystems
9 Pandemic Economic Impact (Batches) Data Engineering Serverless Spark Dataproc Batches TEKsystems
10 Retail Store Analytics - Spark SQL SQL Data Analysis Spark SQL run on Serverless Spark Batch with Dataproc Metastore TEKsystems
11 Telco Customer Churn Prediction ML Ops Powered by Dataproc Serverless, showcasing integration with Vertex AI Workbench Anagha Khanolkar
12 Sales and Marketing Campaign and Promotion Streaming Application Streaming Analytics Streaming from Kafka into BigQuery, with Apache Spark Structured Streaming powered by Dataproc Serverless Anagha Khanolkar
13 Telco Anomaly Detection (with row level security) Data Engineering Identifying defective cell towers for maintenance: using Terraform to deploy GCP components, using BigLake to create GCS external tables in PARQUET and CSV files formats and to unify row access policies from BigQuery and Serverless Spark, and doing ELT, ML, data governance, and orchestration with BigQuery integrations (Dataform, BQML, BI Engine, Dataplex) Luis Velasco
14 Spark MLOps Pipeline Data Scientist Spark MLlib based scalable machine learning on Google Cloud, powered by Dataproc Serverless Spark and showcases integration with Vertex AI AIML platform (Dataporc, BigQuery, Vertex AI, Google Cloud Storage, Cloud Composer, Cloud Functions, Cloud Scheduler) Anagha Khanolkar and TEKsystems
15 Daily Covid Data Analysis Data Engineering Serverless Spark Dataproc Batches TEKsystems
16 Customer Churn Prediction using Vertex AI Data Engineering & Data Scientist Serverless Spark Interactive Sessions through Vertex AI TEKsystems
17 Loan Data Analysis Data Engineering Using Delta Lake with Dataproc Serverless Spark on GCP via Jupyter notebooks on Vertex AI Workbench managed notebooks Anagha Khanolkar
18 Pandemic Economic Impact (Scala) Data Engineering Serverless Spark Dataproc Batches TEKsystems
19 BigQuery Shakespeare Word Count Data Engineering Apache Spark Stored Procedures in BigQuery TEKsystems
20 Wikipedia Page Views Analysis demonstrating auto-scaling Data Analytics Serverless Spark Dataproc Batches, BigQuery Anagha Khanolkar and TEKsystems
21 Game of Thrones Graph Dataset Analysis Data Engineering Serverless Spark Dataproc Batches, BigQuery TEKsystems
22 Customer Churn Rate Prediction using BigLake tables Data Engineering & Data Scientist Serverless Spark Dataproc Batches, BigQuery, Biglake TEKsystems
23 Game of Thrones Graph Dataset Analysis using R Data Engineering Serverless Spark Dataproc Batches, BigQuery TEKsystems
24 Spark dataframe analysis on Vertex Generative AI GenAI Integrated LLMs text generation capablities with Apache Spark, powered by Vertex AI on Google Cloud Luis Velasco

Contributing

See the contributing instructions to start contributing.

License

All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

Serverless Spark Templates

Check out this repository for Dataproc Serverless ready-to-use, config driven Spark templates for solving simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations.

Serverless Spark Environment Provisioning, Configuring, and Automation

Check out this repository for how to use Terraform to provision, configure, and automate Data Analytics services on GCP.

Contact

Interested in a free, guided, and hands-on Spark Workshop to run these solution accelerators in your GCP environment? Please fill out this form.