🚀 Extract, Transform, and Load (ETL) real-time Reddit data, analyze sentiment, and store it in a star-schema data warehouse.
This project extracts trending Reddit posts from a subreddit, performs sentiment analysis, and stores structured data in a PostgreSQL data warehouse using a star schema.
✅ ETL Pipeline: Extracts, cleans, and loads real-time Reddit data.
✅ Sentiment Analysis: Uses AI (TextBlob) to categorize sentiment (Positive, Neutral, Negative).
✅ Data Warehouse: Stores posts in a structured star schema for optimized querying.
✅ SQL Optimization: Enables fast queries for insights into top trends and discussions.
🔹 Python – For data extraction, transformation, and loading
🔹 Reddit API (PRAW) – Fetches live Reddit data
🔹 PostgreSQL – Stores structured data
🔹 SQLAlchemy – Connects Python to PostgreSQL
🔹 TextBlob – Performs sentiment analysis
🔹 Pandas – Data transformation
🔹 Git – Version control
┌──────────────────────────┐
│ Reddit API (PRAW) │
└──────────┬──────────────┘
▼
┌──────────────────────────┐
│ Python ETL Script │
│ - Extract: Fetch posts │
│ - Transform: Clean data │
│ - Sentiment Analysis │
└──────────┬──────────────┘
▼
┌──────────────────────────┐
│ PostgreSQL (Star Schema)│
│ - fact_posts │
│ - dim_authors
│ - dim_subreddit │
│ - dim_date │
└──────────┬──────────────┘
▼
┌──────────────────────────┐
│ SQL Queries & Insights │
│ - Top trending posts │
│ - Sentiment breakdown │
│ - Most active users │
└──────────────────────────┘
git clone https://github.com/yourusername/reddit-trend-tracker.git
cd reddit-trend-trackerEnsure you have Python 3.8+ installed. Then, install the required libraries:
pip install praw pandas textblob sqlalchemy psycopg2Make sure to setup the praw configuration from praw's docs.
- Open PostgreSQL and create a new database:
CREATE DATABASE reddit_dw;- Run the SQL schema script to set up tables:
psql -U your_user -d reddit_dw -f setup_schema.sql| Column | Type | Description |
|---|---|---|
| post_id | VARCHAR(20) | Reddit post ID |
| author_id | INT | Foreign key to dim_authors |
| subreddit_id | INT | Foreign key to dim_subreddit |
| score | INT | Upvotes for the post |
| num_comments | INT | Number of comments on the post |
| sentiment | VARCHAR(20) | Positive, Negative, or Neutral |
| date_id | INT | Foreign key to dim_date |
| Column | Type | Description |
|---|---|---|
| id | SERIAL (PK) | Unique author ID |
| author_name | VARCHAR(255) | Reddit username |
| Column | Type | Description |
|---|---|---|
| id | SERIAL (PK) | Unique subreddit ID |
| subreddit_name | VARCHAR(255) | Subreddit name |
| Column | Type | Description |
|---|---|---|
| date_id | SERIAL (PK) | Unique date ID |
| date_value | DATE | Actual date |
| year | INT | Year of post |
| month | INT | Month of post |
| day_of_week | INT | Day of the week |
Run the Python script to fetch and process Reddit posts:
python extract_transform.pyAfter extracting data, load it into your database:
python load_to_postgres.pySELECT p.post_id, a.author_name, p.score
FROM fact_posts p
JOIN dim_authors a ON p.author_id = a.author_id
ORDER BY p.score DESC
LIMIT 5;SELECT sentiment, COUNT(*) as post_count
FROM fact_posts
GROUP BY sentiment
ORDER BY post_count DESC;SELECT a.author_name, COUNT(*) as post_count
FROM fact_posts p
JOIN dim_authors a ON p.author_id = a.author_id
GROUP BY a.author_name
ORDER BY post_count DESC
LIMIT 10;- Enhance Sentiment Analysis: Use NLTK or a machine learning model instead of
TextBlob. - Streamline with Apache Airflow: Automate ETL pipeline scheduling.
- Deploy on AWS: Store processed data in Amazon S3, run queries using AWS Athena.
- Create a Web Dashboard: Use Tableau or Power BI to visualize Reddit trends.
🌟 Ready to dive into the data? Clone, run, and explore the trends!
📌 GitHub Repository: 🔗 Link to Repo