Skip to content

shanskarBansal/CommentLens-Showcase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

πŸ” CommentLens

Multi-Platform Social Media Comment Intelligence & AI-Powered Sentiment Analysis

Python Apify YouTube API Gemini AI Google Sheets


Status Platforms Concurrency Output

πŸ”’ Source code is private (client project at Varahe Analytics)
This repo serves as a showcase & technical documentation of the system architecture and capabilities.

Features Β· Architecture Β· Pipeline Β· Platforms Β· Tech Stack Β· Output Β· Author


πŸ“– Overview

CommentLens is an enterprise-grade intelligence platform that extracts comments from social media posts across 4 major platforms β€” Facebook, Instagram, Twitter/X, and YouTube β€” and then performs AI-powered sentiment analysis using Google's Gemini model.

The system is designed for scale and reliability: it reads post URLs from a centralized Google Sheet, extracts comments in parallel using platform-specific actors, analyzes each comment's sentiment via LLM, generates thematic summaries, and writes all outputs back to both local CSV files and Google Sheets β€” with atomic writes and graceful error handling throughout.

🎯 Built for: Social media intelligence teams, political campaign analysts, brand monitoring, and engagement analysis at scale.


✨ Features

🌐 Multi-Platform Extraction

  • Facebook β€” Post & Reel comments via Apify
  • Instagram β€” Reel & Post comments via Apify
  • Twitter/X β€” Conversation replies via Apify
  • YouTube β€” Video comments via official Data API v3

🧠 AI-Powered Analysis

  • Sentiment Classification β€” Positive, Negative, Neutral, Mixed, Unclear
  • Confidence Scoring β€” 0.0 to 1.0 per comment
  • Thematic Summarization β€” AI-generated themes per sentiment group
  • Powered by Google Gemini (configurable model)

⚑ Performance & Reliability

  • Parallel extraction with ThreadPoolExecutor
  • Configurable worker pools per platform
  • Atomic CSV writes β€” no partial files on crash
  • Retry logic with exponential backoff for API calls
  • Graceful error isolation β€” one platform failing doesn't crash others

πŸ“Š Smart I/O

  • Google Sheets as input β€” read post URLs & metadata
  • Google Sheets as output β€” push results back automatically
  • Local CSV backup β€” timestamped output directories
  • Top-N engagement filtering β€” only process the most engaging posts
  • Status tracking β€” per-URL success/failure/record counts

πŸ— Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        CommentLens β€” Pipeline Architecture               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚              β”‚     β”‚        PARALLEL EXTRACTION ENGINE          β”‚   β”‚
β”‚   β”‚  Google      β”‚     β”‚                                            β”‚   β”‚
β”‚   β”‚  Sheets      │────▢│  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚   β”‚  (Input)     β”‚     β”‚  β”‚  FB  β”‚ β”‚  IG  β”‚ β”‚  YT  β”‚ β”‚  X / TW  β”‚ β”‚   β”‚
β”‚   β”‚              β”‚     β”‚  β”‚Apify β”‚ β”‚Apify β”‚ β”‚API v3β”‚ β”‚  Apify   β”‚ β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚                        β”‚     β”‚        β”‚        β”‚           β”‚        β”‚   β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              β”‚        β”‚        β”‚           β”‚            β”‚
β”‚                              β–Ό        β–Ό        β–Ό           β–Ό            β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                        β”‚         UNIFIED COMMENT DATAFRAME          β”‚   β”‚
β”‚                        β”‚  [inputUrl, postCaption, profileName,      β”‚   β”‚
β”‚                        β”‚   comment_text]                            β”‚   β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                           β”‚                             β”‚
β”‚                                           β–Ό                             β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                        β”‚        GEMINI AI ANALYSIS ENGINE           β”‚   β”‚
β”‚                        β”‚                                            β”‚   β”‚
β”‚                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚
β”‚                        β”‚  β”‚  Sentiment   β”‚  β”‚    Thematic       β”‚  β”‚   β”‚
β”‚                        β”‚  β”‚  Per Comment β”‚  β”‚    Summaries      β”‚  β”‚   β”‚
β”‚                        β”‚  β”‚  (10 threads)β”‚  β”‚    Per Sentiment  β”‚  β”‚   β”‚
β”‚                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                           β”‚                             β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚
β”‚                              β–Ό                         β–Ό                β”‚
β”‚                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚                   β”‚   Local CSVs     β”‚      β”‚  Google Sheets   β”‚        β”‚
β”‚                   β”‚  (Atomic Write)  β”‚      β”‚   (Auto-push)   β”‚        β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ How the Pipeline Works

flowchart TD
    A["πŸ“‹ Google Sheet\n(Post URLs + Metadata)"] -->|Parallel Read| B["πŸ“₯ Load & Filter\nTop-N by Engagement"]
    
    B --> C{"Platform\nRouter"}
    
    C -->|Facebook| D["πŸ”΅ Apify Actor\nFB Comment Extraction"]
    C -->|Instagram| E["πŸ“Έ Apify Actor\nIG Comment Extraction"]
    C -->|YouTube| F["πŸ”΄ YouTube Data API v3\nComment Threads"]
    C -->|Twitter/X| G["🐦 Apify Actor\nX Conversation Extraction"]
    
    D --> H["πŸ”€ Merge into\nUnified DataFrame"]
    E --> H
    F --> H
    G --> H
    
    H --> I["🧠 Gemini AI\nSentiment Analysis\n(Multi-threaded)"]
    
    I --> J["πŸ“Š Sentiment Results\n+ Confidence Scores"]
    
    J --> K["πŸ“ Thematic Summary\nGeneration per Sentiment"]
    
    J --> L["πŸ’Ύ Output"]
    K --> L
    
    L --> M["πŸ“ CSV Files\n(Timestamped Dir)"]
    L --> N["πŸ“Š Google Sheets\n(Auto-pushed)"]

    style A fill:#4285F4,color:#fff
    style I fill:#EA4335,color:#fff
    style K fill:#FBBC04,color:#000
    style M fill:#34A853,color:#fff
    style N fill:#34A853,color:#fff
Loading

Step-by-Step Breakdown

Step Action Details
1 πŸ“₯ Ingest Read post URLs from Google Sheets (4 tabs: FB, IG, YT, X) in parallel
2 πŸ† Filter Rank posts by engagement metrics, take top-N per platform
3 πŸ”€ Route Send each platform's URLs to its dedicated extraction engine
4 πŸ’¬ Extract Pull comments using Apify actors (FB/IG/X) or YouTube API
5 πŸ“Š Normalize Unify all comments into a standard schema: [inputUrl, postCaption, profileName, comment_text]
6 🧠 Analyze Send each comment to Gemini AI for sentiment classification + confidence scoring
7 πŸ“ Summarize Group by sentiment β†’ generate thematic summaries using Gemini
8 πŸ’Ύ Export Atomic CSV writes + Google Sheets push with retry logic

🌍 Supported Platforms

πŸ”΅ Facebook

Apify Actor
Posts, Reels, Videos
Comment text + Profile names
Configurable limits

πŸ“Έ Instagram

Apify Actor
Posts, Reels, Stories
Comment text + Full names
User metadata extraction

πŸ”΄ YouTube

Official API v3
Video comments
Relevance-sorted
Pagination support
No rate limit issues

🐦 Twitter / X

Apify Actor
Conversation replies
Tweet-type filtering
Author name extraction
Post ID-based lookup


βš™οΈ Configuration

The entire pipeline is driven by a single params.yaml file:

# API Keys (redacted)
google_bigquery: "path/to/service-account.json"
apify_api: "apify_api_xxxxx"
youtube_api: "AIzaSyXXXXXX"
gemini_api: "your-gemini-key"
model: "gemini-flash-latest"

# Google Sheet with post URLs
sheet_id: "1uqeFBDrZocHUFwSOnHZzgOiD..."
fb_tab_name: FB
ig_tab_name: IG
x_tab_name: X
yt_tab_name: YT

# Per-platform settings
facebook:
  required: true          # Enable/disable platform
  top_post_taken: 50      # Top-N posts by engagement
  max_comment: 500        # Max comments per post
  workers: 20             # Concurrent extraction threads

instagram:
  required: true
  top_post_taken: 50
  max_comment: 500
  workers: 20

twitter:
  required: true
  top_post_taken: 50
  max_comment: 200
  workers: 20

youtube:
  required: true
  top_post_taken: 50
  max_comment: 500
  workers: 1              # YouTube API is sequential

πŸŽ› Key Configuration Options

Parameter Description Example
required Toggle platform on/off without code changes true / false
top_post_taken Number of top posts (by engagement) to process 50
max_comment Maximum comments to extract per post 500
workers Thread pool size for parallel extraction 20
model Gemini model variant for sentiment analysis gemini-flash-latest

πŸ—‚ Project Structure

CommentLens/
β”‚
β”œβ”€β”€ main.py              # πŸš€ Entry point β€” orchestrates the full pipeline
β”œβ”€β”€ components.py        # 🧩 All extraction, I/O, and analysis functions
β”œβ”€β”€ params.yaml          # βš™οΈ Configuration (API keys, platform settings)
β”œβ”€β”€ check.ipynb          # πŸ§ͺ Development notebook for testing individual extractors
β”‚
β”œβ”€β”€ download/            # πŸ“ Output directory (timestamped subdirectories)
β”‚   └── MM_DD_HH_MM/    #    Auto-created per run
β”‚       β”œβ”€β”€ fb_data.csv
β”‚       β”œβ”€β”€ fb_status.csv
β”‚       β”œβ”€β”€ fb_data_analysis.csv
β”‚       β”œβ”€β”€ fb_data_summary.csv
β”‚       β”œβ”€β”€ ig_data.csv
β”‚       β”œβ”€β”€ ig_status.csv
β”‚       β”œβ”€β”€ ...
β”‚       └── yt_data_summary.csv
β”‚
└── credentials.json     # πŸ”‘ Google Service Account (not in repo)

πŸ“€ Output Format

Comment Data (*_data.csv)

Column Description
inputUrl Original post URL
postCaption Post caption/text
profileName Commenter's display name
comment_text Raw comment text

Sentiment Analysis (*_data_analysis.csv)

Column Description
inputUrl Original post URL
postCaption Post caption/text
profileName Commenter's display name
comment_text Raw comment text
sentiment_towards_bjp positive / negative / neutral / mixed / unclear
sentiment_confidence Confidence score (0.0 – 1.0)

Thematic Summary (*_data_summary.csv)

Column Description
sentiment Sentiment category
count Number of comments
percentage % of total comments
thematic_summary AI-generated summary of themes in that sentiment group

Status Report (*_status.csv)

Column Description
Url Post URL that was processed
Status Done N / No comments / Error ... / Invalid Link
Records Number of comments extracted

πŸ›‘ Reliability Features

πŸ”„ Atomic CSV Writes

All CSV outputs use a write-to-temp β†’ atomic-rename pattern. If the process crashes mid-write, you'll never end up with a corrupt or partial CSV file.

# Write to .tmp first, then atomic rename
tmp = path.with_suffix(path.suffix + ".tmp")
df.to_csv(tmp, index=False)
os.replace(tmp, path)  # atomic on same filesystem
πŸ” Google Sheets Retry Logic

Sheet writes include automatic retry with configurable attempts and delay. If Sheets API fails after all retries, the CSV backup is already saved β€” no data loss.

🧱 Platform Isolation

Each platform runs in its own thread via ThreadPoolExecutor. If Instagram's Apify actor fails, Facebook/YouTube/Twitter continue unaffected. Errors are captured and reported in the status CSV.

πŸ“‹ Per-URL Status Tracking

Every URL gets a status entry: Done 47, No comments, Invalid Link, or Error <message>. This makes it easy to audit which posts succeeded and which need re-processing.

βš™οΈ YAML-Driven Configuration

No code changes needed to enable/disable platforms, adjust comment limits, or switch AI models. Everything is controlled through params.yaml.


πŸ› οΈ Tech Stack

Layer Technology Purpose
Language Python 3.10+ Core runtime
Extraction Apify Client FB, IG, X comment extraction
YouTube Google YouTube Data API v3 YT comment extraction
AI/LLM Google Gemini (generativeai) Sentiment analysis + summarization
Data Pandas DataFrame processing & manipulation
Concurrency ThreadPoolExecutor Parallel extraction & analysis
I/O gspread + gspread_dataframe Google Sheets read/write
Auth oauth2client Google Service Account auth
Config PyYAML YAML-based configuration
Progress tqdm Progress bars for analysis

πŸ“ˆ Scale & Performance

Metric Typical Value
Platforms processed simultaneously Up to 4
Comments per run 1,000 – 50,000+
Sentiment analysis throughput ~10 comments/sec (10 threads)
End-to-end pipeline time 5–30 minutes (depending on volume)
Output destinations CSV + Google Sheets (simultaneous)

πŸ” Security & Privacy

  • πŸ”‘ All API keys stored in params.yaml (excluded from version control)
  • πŸ”’ Google Service Account credentials via JSON keyfile
  • 🚫 No hardcoded secrets in source code
  • πŸ“ Credentials file never committed to repository

πŸ‘¨β€πŸ’» Author

Shanskar Bansal
Senior Consultant (Data Scientist) at Varahe Analytics

GitHub Portfolio LinkedIn


Built with ❀️ at Varahe Analytics Pvt. Ltd.

⭐ If you found this interesting, give it a star!

About

πŸ’¬ CommentScraper AI β€” Multi-Platform Social Media Comment Extraction & AI-Powered Sentiment Analysis across FB, IG, YT & X | Source code is private

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors