A production-ready, end-to-end real-time streaming analytics platform running on Kubernetes with Kafka, Spark, Trino, and Apache Iceberg.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Data Sources │───▶│ Kafka Cluster │───▶│ Spark Streaming│
│ (Producers) │ │ (Strimzi) │ │ (K8s Jobs) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Trino Query │◀───│ Apache Iceberg │◀───│ Spark Output │
│ Engine │ │ (S3/MinIO) │ │ (Spark) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
- Google Cloud Platform account
- kubectl configured
- Helm 3.x installed
- Terraform installed
-
Deploy Infrastructure:
cd terraform cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars with your project_id terraform init && terraform apply
-
Deploy Kafka:
cd ../helm helm install strimzi strimzi/strimzi-kafka-operator --namespace kafka --create-namespace helm install kafka ./kafka --namespace kafka kubectl apply -f kafka-topics.yaml -
Deploy Producer:
cd ../kafka-producer docker build -t streaming-producer . kubectl apply -f k8s/
-
Deploy Spark:
cd ../spark-jobs docker build -t streaming-spark . kubectl apply -f spark-application.yaml
-
Deploy Trino:
cd ../trino helm install trino ./trino --namespace trino --create-namespace -
Deploy Monitoring:
cd ../observability kubectl apply -f prometheus/ kubectl apply -f grafana/ kubectl apply -f alerts/
- Terraform: Infrastructure as Code for GKE cluster
- Kafka: Distributed streaming platform with Strimzi operator
- Spark: Real-time stream processing with PySpark
- Trino: Interactive SQL query engine
- Iceberg: Table format for data lake
- Prometheus: Metrics collection and monitoring
- Grafana: Visualization and dashboards
- Throughput: 100k+ events/second
- Latency: < 30 seconds end-to-end
- Scalability: Auto-scaling based on load
- Reliability: Fault-tolerant with checkpointing
MIT License