🔄 Change Data Capture System with Real-Time Analytics

Table of contents:

Overview
Features
Architecture
Key Technologies
Steps to Run This Project

📖 Overview

This project is a Change Data Capture (CDC) System designed to demonstrate real-time data processing and analytics capabilities. It leverages streaming technologies to monitor database changes, transform the data, and provide insights in real time.

The system highlights data engineering concepts such as event-driven architectures, fault tolerance, and latency management using watermarks. It is built to handle real-world scenarios efficiently and effectively.

✨ Features

📡 Real-Time Data Capture: Monitors changes in a PostgreSQL database and streams them as events.
⚙️ Event Processing with Flink: Processes and analyzes data streams, applying transformations and basic analytics.
🛡️ Fault Tolerance: Uses Kafka's three-broker setup to ensure high availability.
🔍 Custom Health Checks and Scheduling: Enhances Docker Compose orchestration reliability.
📊 Real-Time Dashboards: Streams processed data back to the database for visualization.

🛠️ Architecture

Data Ingestion: A Python worker ingests data into a PostgreSQL database and configures Kafka Connect, when it starts up.
Change Data Capture: Debezium, integrated with Kafka Connect, converts database changes into Kafka events.
Stream Processing: Apache Flink processes Kafka events, applying transformations and analytics.
Output Pipelines: The processed data is sent back to Kafka topics and sinked in a database for dashboard consumption.

🚀 Key Technologies

PostgreSQL: Initial and final data storage.
Kafka & Kafka Connect: Core of the event streaming platform.
Debezium: Change Data Capture integration.
Apache Flink: Stream processing and real-time analytics.
Docker Compose: Orchestration of all system components.
Python: Custom scripts for data ingestion and configuration.

🧰 Steps to Run This Project

Fork/clone this repository. The smallest GitHub Codespace is enough to handle this project.
Create a copy of sample.env as .env in the root folder. Adjust setup in newly created file if needed.
```
cp sample.env .env
```

Download Debezium connector files (Postgres Source and JDBC Sink).

ARCHIVE=debezium-connector.tar
OUTPUT_FOLDER=$(pwd)/jars

URL=https://repo1.maven.org/maven2/io/debezium/debezium-connector-jdbc/3.0.2.Final/debezium-connector-jdbc-3.0.2.Final-plugin.tar.gz
curl -o "$ARCHIVE" "$URL"
tar -xzvf $ARCHIVE -C $OUTPUT_FOLDER && rm $ARCHIVE

URL=https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/3.0.2.Final/debezium-connector-postgres-3.0.2.Final-plugin.tar.gz
curl -o "$ARCHIVE" "$URL"
tar -xzvf $ARCHIVE -C $OUTPUT_FOLDER && rm $ARCHIVE

Download Flink's Upsert Kafka SQL connector:

URL=https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.3.0-1.20/flink-sql-connector-kafka-3.3.0-1.20.jar
curl -o "$ARCHIVE" "$URL"
tar -xzvf $ARCHIVE -C $OUTPUT_FOLDER && rm $ARCHIVE

Download sample dataset to stream and join:
Application was prepared for Taxi data, particularly yellow taxi data with taxi zones.

Build the Python worker in the root folder:

docker build -t worker -f src/worker/Dockerfile .

Run the Docker Compose file:
```
cd /workspaces/local-data-streaming/docker
docker-compose --env-file=../.env up -d
```
The first run will take significantly longer due to image pulling. Subsequent runs take approximately 3 minutes to complete due to Kafka Connect startup.
Access services through port forwarding:

Port Service

80 pgAdmin

8080 KafkaUI

8081 Flink Dashboard

8081 Kafka Connect
If any Flink sequence modification is required, Flink client can be started with snippet:

cd /workspaces/local-data-streaming/docker
docker compose run flink-client

📅 Future Improvements

📈 Showcase Dashboards: Examples of real-time insights and data visualizations.
🧮 Enhanced Analytics: Incorporating advanced analytical features for deeper insights.

🧠 Lessons Learned

⚙️ Managing Docker Compose for complex systems required custom health checks and container scheduling for reliability.
🕒 Working with streaming data highlighted the importance of watermarks and event-time handling for accurate results.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docker		docker
jars		jars
rides_data		rides_data
sql-client		sql-client
src		src
.gitignore		.gitignore
README.md		README.md
sample.env		sample.env
system_diagram.png		system_diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔄 Change Data Capture System with Real-Time Analytics

📖 Overview

✨ Features

🛠️ Architecture

🚀 Key Technologies

🧰 Steps to Run This Project

📅 Future Improvements

🧠 Lessons Learned

About

Uh oh!

Releases

Packages

Languages

Port	Service
80	pgAdmin
8080	KafkaUI
8081	Flink Dashboard
8081	Kafka Connect

PK109/local-data-streaming

Folders and files

Latest commit

History

Repository files navigation

🔄 Change Data Capture System with Real-Time Analytics

📖 Overview

✨ Features

🛠️ Architecture

🚀 Key Technologies

🧰 Steps to Run This Project

📅 Future Improvements

🧠 Lessons Learned

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages