🔍 CommentLens

Multi-Platform Social Media Comment Intelligence & AI-Powered Sentiment Analysis

🔒 Source code is private (client project at Varahe Analytics)
This repo serves as a showcase & technical documentation of the system architecture and capabilities.

Features · Architecture · Pipeline · Platforms · Tech Stack · Output · Author

📖 Overview

CommentLens is an enterprise-grade intelligence platform that extracts comments from social media posts across 4 major platforms — Facebook, Instagram, Twitter/X, and YouTube — and then performs AI-powered sentiment analysis using Google's Gemini model.

The system is designed for scale and reliability: it reads post URLs from a centralized Google Sheet, extracts comments in parallel using platform-specific actors, analyzes each comment's sentiment via LLM, generates thematic summaries, and writes all outputs back to both local CSV files and Google Sheets — with atomic writes and graceful error handling throughout.

🎯 Built for: Social media intelligence teams, political campaign analysts, brand monitoring, and engagement analysis at scale.

✨ Features

🌐 Multi-Platform Extraction Facebook — Post & Reel comments via Apify Instagram — Reel & Post comments via Apify Twitter/X — Conversation replies via Apify YouTube — Video comments via official Data API v3	🧠 AI-Powered Analysis Sentiment Classification — Positive, Negative, Neutral, Mixed, Unclear Confidence Scoring — 0.0 to 1.0 per comment Thematic Summarization — AI-generated themes per sentiment group Powered by Google Gemini (configurable model)
⚡ Performance & Reliability Parallel extraction with ThreadPoolExecutor Configurable worker pools per platform Atomic CSV writes — no partial files on crash Retry logic with exponential backoff for API calls Graceful error isolation — one platform failing doesn't crash others	📊 Smart I/O Google Sheets as input — read post URLs & metadata Google Sheets as output — push results back automatically Local CSV backup — timestamped output directories Top-N engagement filtering — only process the most engaging posts Status tracking — per-URL success/failure/record counts

🏗 Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        CommentLens — Pipeline Architecture               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌──────────────┐     ┌────────────────────────────────────────────┐   │
│   │              │     │        PARALLEL EXTRACTION ENGINE          │   │
│   │  Google      │     │                                            │   │
│   │  Sheets      │────▶│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────────┐ │   │
│   │  (Input)     │     │  │  FB  │ │  IG  │ │  YT  │ │  X / TW  │ │   │
│   │              │     │  │Apify │ │Apify │ │API v3│ │  Apify   │ │   │
│   └──────────────┘     │  └──┬───┘ └──┬───┘ └──┬───┘ └────┬─────┘ │   │
│                        │     │        │        │           │        │   │
│                        └─────┼────────┼────────┼───────────┼────────┘   │
│                              │        │        │           │            │
│                              ▼        ▼        ▼           ▼            │
│                        ┌────────────────────────────────────────────┐   │
│                        │         UNIFIED COMMENT DATAFRAME          │   │
│                        │  [inputUrl, postCaption, profileName,      │   │
│                        │   comment_text]                            │   │
│                        └──────────────────┬─────────────────────────┘   │
│                                           │                             │
│                                           ▼                             │
│                        ┌────────────────────────────────────────────┐   │
│                        │        GEMINI AI ANALYSIS ENGINE           │   │
│                        │                                            │   │
│                        │  ┌──────────────┐  ┌───────────────────┐  │   │
│                        │  │  Sentiment   │  │    Thematic       │  │   │
│                        │  │  Per Comment │  │    Summaries      │  │   │
│                        │  │  (10 threads)│  │    Per Sentiment  │  │   │
│                        │  └──────────────┘  └───────────────────┘  │   │
│                        └──────────────────┬─────────────────────────┘   │
│                                           │                             │
│                              ┌────────────┴────────────┐                │
│                              ▼                         ▼                │
│                   ┌──────────────────┐      ┌──────────────────┐        │
│                   │   Local CSVs     │      │  Google Sheets   │        │
│                   │  (Atomic Write)  │      │   (Auto-push)   │        │
│                   └──────────────────┘      └──────────────────┘        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

🔄 How the Pipeline Works

flowchart TD
    A["📋 Google Sheet\n(Post URLs + Metadata)"] -->|Parallel Read| B["📥 Load & Filter\nTop-N by Engagement"]
    
    B --> C{"Platform\nRouter"}
    
    C -->|Facebook| D["🔵 Apify Actor\nFB Comment Extraction"]
    C -->|Instagram| E["📸 Apify Actor\nIG Comment Extraction"]
    C -->|YouTube| F["🔴 YouTube Data API v3\nComment Threads"]
    C -->|Twitter/X| G["🐦 Apify Actor\nX Conversation Extraction"]
    
    D --> H["🔀 Merge into\nUnified DataFrame"]
    E --> H
    F --> H
    G --> H
    
    H --> I["🧠 Gemini AI\nSentiment Analysis\n(Multi-threaded)"]
    
    I --> J["📊 Sentiment Results\n+ Confidence Scores"]
    
    J --> K["📝 Thematic Summary\nGeneration per Sentiment"]
    
    J --> L["💾 Output"]
    K --> L
    
    L --> M["📁 CSV Files\n(Timestamped Dir)"]
    L --> N["📊 Google Sheets\n(Auto-pushed)"]

    style A fill:#4285F4,color:#fff
    style I fill:#EA4335,color:#fff
    style K fill:#FBBC04,color:#000
    style M fill:#34A853,color:#fff
    style N fill:#34A853,color:#fff

Step-by-Step Breakdown

Step	Action	Details
1	📥 Ingest	Read post URLs from Google Sheets (4 tabs: FB, IG, YT, X) in parallel
2	🏆 Filter	Rank posts by engagement metrics, take top-N per platform
3	🔀 Route	Send each platform's URLs to its dedicated extraction engine
4	💬 Extract	Pull comments using Apify actors (FB/IG/X) or YouTube API
5	📊 Normalize	Unify all comments into a standard schema: `[inputUrl, postCaption, profileName, comment_text]`
6	🧠 Analyze	Send each comment to Gemini AI for sentiment classification + confidence scoring
7	📝 Summarize	Group by sentiment → generate thematic summaries using Gemini
8	💾 Export	Atomic CSV writes + Google Sheets push with retry logic

🌍 Supported Platforms

🔵 Facebook

Apify Actor
Posts, Reels, Videos
Comment text + Profile names
Configurable limits

📸 Instagram

Apify Actor
Posts, Reels, Stories
Comment text + Full names
User metadata extraction

🔴 YouTube

Official API v3
Video comments
Relevance-sorted
Pagination support
No rate limit issues

🐦 Twitter / X

Apify Actor
Conversation replies
Tweet-type filtering
Author name extraction
Post ID-based lookup

⚙️ Configuration

The entire pipeline is driven by a single params.yaml file:

# API Keys (redacted)
google_bigquery: "path/to/service-account.json"
apify_api: "apify_api_xxxxx"
youtube_api: "AIzaSyXXXXXX"
gemini_api: "your-gemini-key"
model: "gemini-flash-latest"

# Google Sheet with post URLs
sheet_id: "1uqeFBDrZocHUFwSOnHZzgOiD..."
fb_tab_name: FB
ig_tab_name: IG
x_tab_name: X
yt_tab_name: YT

# Per-platform settings
facebook:
  required: true          # Enable/disable platform
  top_post_taken: 50      # Top-N posts by engagement
  max_comment: 500        # Max comments per post
  workers: 20             # Concurrent extraction threads

instagram:
  required: true
  top_post_taken: 50
  max_comment: 500
  workers: 20

twitter:
  required: true
  top_post_taken: 50
  max_comment: 200
  workers: 20

youtube:
  required: true
  top_post_taken: 50
  max_comment: 500
  workers: 1              # YouTube API is sequential

🎛 Key Configuration Options

Parameter	Description	Example
`required`	Toggle platform on/off without code changes	`true` / `false`
`top_post_taken`	Number of top posts (by engagement) to process	`50`
`max_comment`	Maximum comments to extract per post	`500`
`workers`	Thread pool size for parallel extraction	`20`
`model`	Gemini model variant for sentiment analysis	`gemini-flash-latest`

🗂 Project Structure

CommentLens/
│
├── main.py              # 🚀 Entry point — orchestrates the full pipeline
├── components.py        # 🧩 All extraction, I/O, and analysis functions
├── params.yaml          # ⚙️ Configuration (API keys, platform settings)
├── check.ipynb          # 🧪 Development notebook for testing individual extractors
│
├── download/            # 📁 Output directory (timestamped subdirectories)
│   └── MM_DD_HH_MM/    #    Auto-created per run
│       ├── fb_data.csv
│       ├── fb_status.csv
│       ├── fb_data_analysis.csv
│       ├── fb_data_summary.csv
│       ├── ig_data.csv
│       ├── ig_status.csv
│       ├── ...
│       └── yt_data_summary.csv
│
└── credentials.json     # 🔑 Google Service Account (not in repo)

📤 Output Format

Comment Data (`*_data.csv`)

Column	Description
`inputUrl`	Original post URL
`postCaption`	Post caption/text
`profileName`	Commenter's display name
`comment_text`	Raw comment text

Sentiment Analysis (`*_data_analysis.csv`)

Column	Description
`inputUrl`	Original post URL
`postCaption`	Post caption/text
`profileName`	Commenter's display name
`comment_text`	Raw comment text
`sentiment_towards_bjp`	`positive` / `negative` / `neutral` / `mixed` / `unclear`
`sentiment_confidence`	Confidence score (0.0 – 1.0)

Thematic Summary (`*_data_summary.csv`)

Column	Description
`sentiment`	Sentiment category
`count`	Number of comments
`percentage`	% of total comments
`thematic_summary`	AI-generated summary of themes in that sentiment group

Status Report (`*_status.csv`)

Column	Description
`Url`	Post URL that was processed
`Status`	`Done N` / `No comments` / `Error ...` / `Invalid Link`
`Records`	Number of comments extracted

🛡 Reliability Features

🔄 Atomic CSV Writes

All CSV outputs use a write-to-temp → atomic-rename pattern. If the process crashes mid-write, you'll never end up with a corrupt or partial CSV file.

# Write to .tmp first, then atomic rename
tmp = path.with_suffix(path.suffix + ".tmp")
df.to_csv(tmp, index=False)
os.replace(tmp, path)  # atomic on same filesystem

🔁 Google Sheets Retry Logic

Sheet writes include automatic retry with configurable attempts and delay. If Sheets API fails after all retries, the CSV backup is already saved — no data loss.

🧱 Platform Isolation

Each platform runs in its own thread via ThreadPoolExecutor. If Instagram's Apify actor fails, Facebook/YouTube/Twitter continue unaffected. Errors are captured and reported in the status CSV.

📋 Per-URL Status Tracking

Every URL gets a status entry: Done 47, No comments, Invalid Link, or Error <message>. This makes it easy to audit which posts succeeded and which need re-processing.

⚙️ YAML-Driven Configuration

No code changes needed to enable/disable platforms, adjust comment limits, or switch AI models. Everything is controlled through params.yaml.

🛠️ Tech Stack

Layer	Technology	Purpose
Language	Python 3.10+	Core runtime
Extraction	Apify Client	FB, IG, X comment extraction
YouTube	Google YouTube Data API v3	YT comment extraction
AI/LLM	Google Gemini (generativeai)	Sentiment analysis + summarization
Data	Pandas	DataFrame processing & manipulation
Concurrency	ThreadPoolExecutor	Parallel extraction & analysis
I/O	gspread + gspread_dataframe	Google Sheets read/write
Auth	oauth2client	Google Service Account auth
Config	PyYAML	YAML-based configuration
Progress	tqdm	Progress bars for analysis

📈 Scale & Performance

Metric	Typical Value
Platforms processed simultaneously	Up to 4
Comments per run	1,000 – 50,000+
Sentiment analysis throughput	~10 comments/sec (10 threads)
End-to-end pipeline time	5–30 minutes (depending on volume)
Output destinations	CSV + Google Sheets (simultaneous)

🔐 Security & Privacy

🔑 All API keys stored in params.yaml (excluded from version control)
🔒 Google Service Account credentials via JSON keyfile
🚫 No hardcoded secrets in source code
📁 Credentials file never committed to repository

👨‍💻 Author

Shanskar Bansal
Senior Consultant (Data Scientist) at Varahe Analytics

Built with ❤️ at Varahe Analytics Pvt. Ltd.

⭐ If you found this interesting, give it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 CommentLens

Multi-Platform Social Media Comment Intelligence & AI-Powered Sentiment Analysis

📖 Overview

✨ Features

🌐 Multi-Platform Extraction

🧠 AI-Powered Analysis

⚡ Performance & Reliability

📊 Smart I/O

🏗 Architecture

🔄 How the Pipeline Works

Step-by-Step Breakdown

🌍 Supported Platforms

🔵 Facebook

📸 Instagram

🔴 YouTube

🐦 Twitter / X

⚙️ Configuration

🎛 Key Configuration Options

🗂 Project Structure

📤 Output Format

Comment Data (`*_data.csv`)

Sentiment Analysis (`*_data_analysis.csv`)

Thematic Summary (`*_data_summary.csv`)

Status Report (`*_status.csv`)

🛡 Reliability Features

🛠️ Tech Stack

📈 Scale & Performance

🔐 Security & Privacy

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🔍 CommentLens

Multi-Platform Social Media Comment Intelligence & AI-Powered Sentiment Analysis

📖 Overview

✨ Features

🌐 Multi-Platform Extraction

🧠 AI-Powered Analysis

⚡ Performance & Reliability

📊 Smart I/O

🏗 Architecture

🔄 How the Pipeline Works

Step-by-Step Breakdown

🌍 Supported Platforms

🔵 Facebook

📸 Instagram

🔴 YouTube

🐦 Twitter / X

⚙️ Configuration

🎛 Key Configuration Options

🗂 Project Structure

📤 Output Format

Comment Data (*_data.csv)

Sentiment Analysis (*_data_analysis.csv)

Thematic Summary (*_data_summary.csv)

Status Report (*_status.csv)

🛡 Reliability Features

🛠️ Tech Stack

📈 Scale & Performance

🔐 Security & Privacy

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Comment Data (`*_data.csv`)

Sentiment Analysis (`*_data_analysis.csv`)

Thematic Summary (`*_data_summary.csv`)

Status Report (`*_status.csv`)

Packages