Table of contents:
This project is a Change Data Capture (CDC) System designed to demonstrate real-time data processing and analytics capabilities. It leverages streaming technologies to monitor database changes, transform the data, and provide insights in real time.
The system highlights data engineering concepts such as event-driven architectures, fault tolerance, and latency management using watermarks. It is built to handle real-world scenarios efficiently and effectively.
- 📡 Real-Time Data Capture: Monitors changes in a PostgreSQL database and streams them as events.
- ⚙️ Event Processing with Flink: Processes and analyzes data streams, applying transformations and basic analytics.
- 🛡️ Fault Tolerance: Uses Kafka's three-broker setup to ensure high availability.
- 🔍 Custom Health Checks and Scheduling: Enhances Docker Compose orchestration reliability.
- 📊 Real-Time Dashboards: Streams processed data back to the database for visualization.
- Data Ingestion: A Python worker ingests data into a PostgreSQL database and configures Kafka Connect, when it starts up.
- Change Data Capture: Debezium, integrated with Kafka Connect, converts database changes into Kafka events.
- Stream Processing: Apache Flink processes Kafka events, applying transformations and analytics.
- Output Pipelines: The processed data is sent back to Kafka topics and sinked in a database for dashboard consumption.
- PostgreSQL: Initial and final data storage.
- Kafka & Kafka Connect: Core of the event streaming platform.
- Debezium: Change Data Capture integration.
- Apache Flink: Stream processing and real-time analytics.
- Docker Compose: Orchestration of all system components.
- Python: Custom scripts for data ingestion and configuration.
-
Fork/clone this repository. The smallest GitHub Codespace is enough to handle this project.
-
Create a copy of sample.env as .env in the root folder. Adjust setup in newly created file if needed.
cp sample.env .env
-
Download Debezium connector files (Postgres Source and JDBC Sink).
ARCHIVE=debezium-connector.tar OUTPUT_FOLDER=$(pwd)/jars URL=https://repo1.maven.org/maven2/io/debezium/debezium-connector-jdbc/3.0.2.Final/debezium-connector-jdbc-3.0.2.Final-plugin.tar.gz curl -o "$ARCHIVE" "$URL" tar -xzvf $ARCHIVE -C $OUTPUT_FOLDER && rm $ARCHIVE URL=https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/3.0.2.Final/debezium-connector-postgres-3.0.2.Final-plugin.tar.gz curl -o "$ARCHIVE" "$URL" tar -xzvf $ARCHIVE -C $OUTPUT_FOLDER && rm $ARCHIVE
-
Download Flink's Upsert Kafka SQL connector:
URL=https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.3.0-1.20/flink-sql-connector-kafka-3.3.0-1.20.jar curl -o "$ARCHIVE" "$URL" tar -xzvf $ARCHIVE -C $OUTPUT_FOLDER && rm $ARCHIVE
-
Download sample dataset to stream and join:
Application was prepared for Taxi data, particularly yellow taxi data with taxi zones. -
Build the Python worker in the root folder:
docker build -t worker -f src/worker/Dockerfile . -
Run the Docker Compose file:
cd /workspaces/local-data-streaming/docker docker-compose --env-file=../.env up -dThe first run will take significantly longer due to image pulling. Subsequent runs take approximately 3 minutes to complete due to Kafka Connect startup.
-
Access services through port forwarding:
Port Service 80 pgAdmin 8080 KafkaUI 8081 Flink Dashboard 8081 Kafka Connect -
If any Flink sequence modification is required, Flink client can be started with snippet:
cd /workspaces/local-data-streaming/docker
docker compose run flink-client - 📈 Showcase Dashboards: Examples of real-time insights and data visualizations.
- 🧮 Enhanced Analytics: Incorporating advanced analytical features for deeper insights.
- ⚙️ Managing Docker Compose for complex systems required custom health checks and container scheduling for reliability.
- 🕒 Working with streaming data highlighted the importance of watermarks and event-time handling for accurate results.
