RealTimeIncrementalSync is a real-time data streaming system for the e-commerce platform BuyOnline. This system streams product updates from a MySQL database to Kafka and deserializes the data into separate JSON files. The project leverages Kafka, MySQL, python and Avro serialization for efficient, scalable, and real-time data streaming.
The project is designed to track product updates in real-time and stream the changes to downstream systems for analysis, monitoring, and reporting. The architecture consists of a Kafka producer that fetches incremental product data from a MySQL database, serializes it into Avro format, and streams it to a Kafka topic. Kafka consumers then deserialize the Avro data and append it to separate JSON files.
- Incremental Data Fetching: Fetches new and updated records from the MySQL database based on the last update timestamp.
- Avro Serialization: Product data is serialized using Avro to ensure efficient, compact, and schema-based data storage.
- Kafka Streaming: Utilizes Kafka to handle multi-partitioned topics for scalable and real-time data streaming.
- JSON File Storage: Consumers deserialize Avro data and append product updates to separate JSON files.
- Real-Time Product Monitoring: The system enables real-time updates for product information, enabling faster analytics and business intelligence.
- Kafka: Real-time message broker for streaming product data.
- MySQL: Database storing product information such as ID, name, category, price, and last updated timestamp.
- Avro: Serialization format used for efficient data transmission between the producer and consumer.
- Python: The primary programming language used for implementing the producer, consumer, and MySQL integration.
- Confluent Kafka: Kafka library used for producing and consuming messages.
- JSON: Format for storing product update records on the consumer side.
- Kafka Cluster: Set up a Kafka cluster (either locally or using Confluent Cloud).
- MySQL Database: Set up a MySQL database with a product table containing columns for id, name, category, price, and last_updated.
- Python: Ensure Python 3.x is installed.
- confluent-kafka library installed:
pip install confluent-kafka
- pandas library installed:
pip install pandas
- mysql-connector-python library installed:
pip install mysql-connector-python
- Install Kafka or set up a Kafka cluster in Confluent Kafka.
- Create a topic in Kafka (e.g.,
product_updates
).
- Update the topic name and Kafka server configurations in
producer.py
andconsumer.py
. - Update MySQL credentials in
producer.py
for your database.
- Set up a MySQL database (e.g.,
project
) - Create table (e.g.,
product
) and populate table with sample data.
Execute the producer to start streaming data to the Kafka topic:
python producer.py
Start one or more consumers to subscribe to the topic (e.g., product_updates
) and consume data:
python consumer.py
- The Kafka producer queries the MySQL database for new or updated product records.
- The product data is serialized using Avro and sent to a Kafka topic (product_updates).
- The producer tracks the last_updated timestamp to fetch only new or updated records in subsequent runs.
- The Kafka consumers subscribes to the
product_updates
Kafka topic. - Upon receiving a message, the consumer deserializes the Avro data and processes the product information.
- Each update is appended to a JSON file, creating a log of product changes.