Skip to content

glassflow/clickhouse-etl

GlassFlow Logo

Docs · Report Bug · Roadmap · Get Help · Watch Demo

Join Next Office Hour Email Support
Slack Twitter

GlassFlow for ClickHouse Streaming ETL

GlassFlow is an open-source ETL tool that enables real-time data processing from Kafka to ClickHouse. GlassFlow pipelines can perform the following operations:

  • Deduplicate: Remove duplicate records based on configurable keys and time windows - use when you need to ensure data uniqueness
  • Join: Perform temporal joins between multiple Kafka topics - use when combining related data streams with time-based matching
  • Deduplicate & Join: Combine both deduplication and joining in a single pipeline
  • Ingest only: Direct data transfer from Kafka to ClickHouse without transformations

⚡️ Quick Start

This guide walks you through a local installation using Docker Compose — perfect for development, testing, or trying out GlassFlow on your machine.

Explore more demos and building pipeline via UI in our docs. To start creating your own pipelines, follow the Usage Guide

  1. Clone the repository:
git clone https://github.com/glassflow/clickhouse-etl.git
cd clickhouse-etl
  1. Go to the demo folder and start the services
cd demos
docker compose up -d

This will start GlassFlow, Kafka and Clickhouse inside of docker.

  1. Once the services are up, run the demo script which will create a topic in kafka, a table in clickhouse and setup a pipeline on glassflow. Since the script is in python, you will need python installed with the needed dependencies.
python3 -m venv venv
pip install -r requirements.txt 
python demo_deduplication.py --num-records 10000 --duplication-rate 0.1

This will send 10000 records to the kafka topic (with 10% duplicates).

  1. Access the web interface at http://localhost:8080 to view the demo pipeline.

  2. View the logs:

# Follow logs in real-time for all containers
docker compose logs -f

# logs for the backend api
docker compose logs api -f

# logs for the UI
docker compose logs ui -f

🧭 Installation Options

GlassFlow is open source and can be self-hosted on Kubernetes. GlassFlow works with any managed Kubernetes services like AWS EKS, GKE, AKS, and more. For local testing or a small POC, you can also use Docker and Docker Compose to run GlassFlow on your local machine.

Method Use Case Docs Link
☸️ Kubernetes with Helm Kubernetes deployment Kubernetes Helm Guide
🐳 Local with Docker Compose Quick evaluation and local testing Local Docker Guide
☁️ AWS EC2 with Docker Compose Lightweight cloud deployment for testing AWS EC2 Guide

🎥 Demo

Live Preview

Log in and see a working demo of GlassFlow running on a GPC cluster at demo.glassflow.dev. You will see a Grafana dashboard and the setup that we used.

GlassFlow Pipeline Data Flow

GlassFlow Pipeline showing real-time streaming from Kafka through GlassFlow to ClickHouse

Demo Video

GlassFlow Overview Video

📚 Documentation

For detailed documentation, visit docs.glassflow.dev. The documentation includes:

🗺️ Roadmap

Check out our public roadmap to see what's coming next in GlassFlow. We're actively working on new features and improvements based on community feedback.

Want to suggest a feature? We'd love to hear from you! Please use our GitHub Discussions to share your ideas and help shape the future of GlassFlow.

✨ Features

  • Streaming deduplication and joins for up to 7d through an inbuilt state store
  • ClickHouse sink with a native protocol for high performance
  • Built-in Kafka connector with SASL, SSL, etc. for nearly all Kafka providers
  • Dead-Letter Queue for handling failed events
  • Field mapping of your Kafka table to ClickHouse
  • Prometheus metrics and OpenTelemetry logs for comprehensive observability

🆘 Support

⚖️ License

This project is licensed under the Apache License 2.0.