Pinterest Data Processing with Kafka and AWS RDS
A project that focuses on processing Pinterest data, using Kafka to stream various data types (posts, geolocation, and user data), and storing the data in AWS RDS for further analysis.
This project involves setting up a Kafka producer to stream Pinterest data, including posts, geolocation, and user data, and using AWS RDS for persistent storage. The data streams are ingested using Kafka topics, which are then processed and stored in an RDS database for further analysis.
- To understand the basics of stream processing with Kafka.
- To set up Kafka topics for different types of data (posts, geolocation, user).
- To connect to AWS RDS and perform SQL operations to retrieve and process data.
- To work with AWS MSK (Managed Streaming for Apache Kafka) for Kafka deployment.
- Setting up Kafka with AWS MSK for managed Kafka services.
- Working with Kafka topics to manage different types of data.
- Integrating Python with SQLAlchemy to interact with a MySQL database.
- Using AWS services to deploy and manage infrastructure for the project.
- Managing data streams with Kafka, simulating the processing of Pinterest data.
To run this project locally, follow these steps:
You must have Kafka and Zookeeper running locally or use AWS MSK for a managed Kafka cluster.
- Follow this guide to install Kafka on your local machine.
- Alternatively, use the Zookeeper connection string from AWS MSK if using managed Kafka.
You will need Python and some libraries for interacting with the database and Kafka.
- Install Python (if not already installed): Download Python
- Install the required Python libraries:
pip install kafka-python sqlalchemy pymysql requests boto3
Ensure you have an AWS RDS MySQL instance running, with tables for pinterest_data
, geolocation_data
, and user_data
.
- Update the connection details in the Python script as per your RDS credentials.
If you're using AWS MSK, you'll need the Zookeeper connection string and Kafka bootstrap server string from the AWS MSK Console.
To create Kafka topics for the project, use the following commands:
-
Pinterest Posts Data topic:
./kafka-topics.sh --create \ <zookeeper file text> --replication-factor 3 \ --partitions 1 \ --topic <topic_name>.pin
-
Pinterest Posts geo topic:
./kafka-topics.sh --create \ <zookeeper file text> --replication-factor 3 \ --partitions 1 \ --topic <topic_name>.geo
-
Pinterest Posts User topic:
./kafka-topics.sh --create \ <zookeeper file text> --replication-factor 3 \ --partitions 1 \ --topic <topic_name>.user