Skip to content

Complete end-to-end real-time streaming analytics platform with Kafka, Spark, Trino, and Iceberg on Kubernetes

Notifications You must be signed in to change notification settings

suhasramanand/streaming-analytics-platform

Repository files navigation

Real-Time Streaming Analytics Platform

A production-ready, end-to-end real-time streaming analytics platform running on Kubernetes with Kafka, Spark, Trino, and Apache Iceberg.

Architecture Overview

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Data Sources  │───▶│   Kafka Cluster │───▶│  Spark Streaming│
│   (Producers)   │    │   (Strimzi)     │    │   (K8s Jobs)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                               │
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Trino Query   │◀───│  Apache Iceberg  │◀───│   Spark Output  │
│   Engine        │    │   (S3/MinIO)     │    │   (Spark)       │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Quick Start

Prerequisites

  • Google Cloud Platform account
  • kubectl configured
  • Helm 3.x installed
  • Terraform installed

Deployment Steps

  1. Deploy Infrastructure:

    cd terraform
    cp terraform.tfvars.example terraform.tfvars
    # Edit terraform.tfvars with your project_id
    terraform init && terraform apply
  2. Deploy Kafka:

    cd ../helm
    helm install strimzi strimzi/strimzi-kafka-operator --namespace kafka --create-namespace
    helm install kafka ./kafka --namespace kafka
    kubectl apply -f kafka-topics.yaml
  3. Deploy Producer:

    cd ../kafka-producer
    docker build -t streaming-producer .
    kubectl apply -f k8s/
  4. Deploy Spark:

    cd ../spark-jobs
    docker build -t streaming-spark .
    kubectl apply -f spark-application.yaml
  5. Deploy Trino:

    cd ../trino
    helm install trino ./trino --namespace trino --create-namespace
  6. Deploy Monitoring:

    cd ../observability
    kubectl apply -f prometheus/
    kubectl apply -f grafana/
    kubectl apply -f alerts/

Components

  • Terraform: Infrastructure as Code for GKE cluster
  • Kafka: Distributed streaming platform with Strimzi operator
  • Spark: Real-time stream processing with PySpark
  • Trino: Interactive SQL query engine
  • Iceberg: Table format for data lake
  • Prometheus: Metrics collection and monitoring
  • Grafana: Visualization and dashboards

Performance

  • Throughput: 100k+ events/second
  • Latency: < 30 seconds end-to-end
  • Scalability: Auto-scaling based on load
  • Reliability: Fault-tolerant with checkpointing

License

MIT License

About

Complete end-to-end real-time streaming analytics platform with Kafka, Spark, Trino, and Iceberg on Kubernetes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published