π Source code is private (client project at Varahe Analytics)
This repo serves as a showcase & technical documentation of the system architecture and capabilities.
Features Β· Architecture Β· Pipeline Β· Platforms Β· Tech Stack Β· Output Β· Author
CommentLens is an enterprise-grade intelligence platform that extracts comments from social media posts across 4 major platforms β Facebook, Instagram, Twitter/X, and YouTube β and then performs AI-powered sentiment analysis using Google's Gemini model.
The system is designed for scale and reliability: it reads post URLs from a centralized Google Sheet, extracts comments in parallel using platform-specific actors, analyzes each comment's sentiment via LLM, generates thematic summaries, and writes all outputs back to both local CSV files and Google Sheets β with atomic writes and graceful error handling throughout.
π― Built for: Social media intelligence teams, political campaign analysts, brand monitoring, and engagement analysis at scale.
|
|
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CommentLens β Pipeline Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β PARALLEL EXTRACTION ENGINE β β
β β Google β β β β
β β Sheets ββββββΆβ ββββββββ ββββββββ ββββββββ ββββββββββββ β β
β β (Input) β β β FB β β IG β β YT β β X / TW β β β
β β β β βApify β βApify β βAPI v3β β Apify β β β
β ββββββββββββββββ β ββββ¬ββββ ββββ¬ββββ ββββ¬ββββ ββββββ¬ββββββ β β
β β β β β β β β
β βββββββΌβββββββββΌβββββββββΌββββββββββββΌβββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β UNIFIED COMMENT DATAFRAME β β
β β [inputUrl, postCaption, profileName, β β
β β comment_text] β β
β ββββββββββββββββββββ¬ββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β GEMINI AI ANALYSIS ENGINE β β
β β β β
β β ββββββββββββββββ βββββββββββββββββββββ β β
β β β Sentiment β β Thematic β β β
β β β Per Comment β β Summaries β β β
β β β (10 threads)β β Per Sentiment β β β
β β ββββββββββββββββ βββββββββββββββββββββ β β
β ββββββββββββββββββββ¬ββββββββββββββββββββββββββ β
β β β
β ββββββββββββββ΄βββββββββββββ β
β βΌ βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Local CSVs β β Google Sheets β β
β β (Atomic Write) β β (Auto-push) β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
flowchart TD
A["π Google Sheet\n(Post URLs + Metadata)"] -->|Parallel Read| B["π₯ Load & Filter\nTop-N by Engagement"]
B --> C{"Platform\nRouter"}
C -->|Facebook| D["π΅ Apify Actor\nFB Comment Extraction"]
C -->|Instagram| E["πΈ Apify Actor\nIG Comment Extraction"]
C -->|YouTube| F["π΄ YouTube Data API v3\nComment Threads"]
C -->|Twitter/X| G["π¦ Apify Actor\nX Conversation Extraction"]
D --> H["π Merge into\nUnified DataFrame"]
E --> H
F --> H
G --> H
H --> I["π§ Gemini AI\nSentiment Analysis\n(Multi-threaded)"]
I --> J["π Sentiment Results\n+ Confidence Scores"]
J --> K["π Thematic Summary\nGeneration per Sentiment"]
J --> L["πΎ Output"]
K --> L
L --> M["π CSV Files\n(Timestamped Dir)"]
L --> N["π Google Sheets\n(Auto-pushed)"]
style A fill:#4285F4,color:#fff
style I fill:#EA4335,color:#fff
style K fill:#FBBC04,color:#000
style M fill:#34A853,color:#fff
style N fill:#34A853,color:#fff
| Step | Action | Details |
|---|---|---|
| 1 | π₯ Ingest | Read post URLs from Google Sheets (4 tabs: FB, IG, YT, X) in parallel |
| 2 | π Filter | Rank posts by engagement metrics, take top-N per platform |
| 3 | π Route | Send each platform's URLs to its dedicated extraction engine |
| 4 | π¬ Extract | Pull comments using Apify actors (FB/IG/X) or YouTube API |
| 5 | π Normalize | Unify all comments into a standard schema: [inputUrl, postCaption, profileName, comment_text] |
| 6 | π§ Analyze | Send each comment to Gemini AI for sentiment classification + confidence scoring |
| 7 | π Summarize | Group by sentiment β generate thematic summaries using Gemini |
| 8 | πΎ Export | Atomic CSV writes + Google Sheets push with retry logic |
|
Apify Actor |
Apify Actor |
Official API v3 |
Apify Actor |
The entire pipeline is driven by a single params.yaml file:
# API Keys (redacted)
google_bigquery: "path/to/service-account.json"
apify_api: "apify_api_xxxxx"
youtube_api: "AIzaSyXXXXXX"
gemini_api: "your-gemini-key"
model: "gemini-flash-latest"
# Google Sheet with post URLs
sheet_id: "1uqeFBDrZocHUFwSOnHZzgOiD..."
fb_tab_name: FB
ig_tab_name: IG
x_tab_name: X
yt_tab_name: YT
# Per-platform settings
facebook:
required: true # Enable/disable platform
top_post_taken: 50 # Top-N posts by engagement
max_comment: 500 # Max comments per post
workers: 20 # Concurrent extraction threads
instagram:
required: true
top_post_taken: 50
max_comment: 500
workers: 20
twitter:
required: true
top_post_taken: 50
max_comment: 200
workers: 20
youtube:
required: true
top_post_taken: 50
max_comment: 500
workers: 1 # YouTube API is sequential| Parameter | Description | Example |
|---|---|---|
required |
Toggle platform on/off without code changes | true / false |
top_post_taken |
Number of top posts (by engagement) to process | 50 |
max_comment |
Maximum comments to extract per post | 500 |
workers |
Thread pool size for parallel extraction | 20 |
model |
Gemini model variant for sentiment analysis | gemini-flash-latest |
CommentLens/
β
βββ main.py # π Entry point β orchestrates the full pipeline
βββ components.py # π§© All extraction, I/O, and analysis functions
βββ params.yaml # βοΈ Configuration (API keys, platform settings)
βββ check.ipynb # π§ͺ Development notebook for testing individual extractors
β
βββ download/ # π Output directory (timestamped subdirectories)
β βββ MM_DD_HH_MM/ # Auto-created per run
β βββ fb_data.csv
β βββ fb_status.csv
β βββ fb_data_analysis.csv
β βββ fb_data_summary.csv
β βββ ig_data.csv
β βββ ig_status.csv
β βββ ...
β βββ yt_data_summary.csv
β
βββ credentials.json # π Google Service Account (not in repo)
| Column | Description |
|---|---|
inputUrl |
Original post URL |
postCaption |
Post caption/text |
profileName |
Commenter's display name |
comment_text |
Raw comment text |
| Column | Description |
|---|---|
inputUrl |
Original post URL |
postCaption |
Post caption/text |
profileName |
Commenter's display name |
comment_text |
Raw comment text |
sentiment_towards_bjp |
positive / negative / neutral / mixed / unclear |
sentiment_confidence |
Confidence score (0.0 β 1.0) |
| Column | Description |
|---|---|
sentiment |
Sentiment category |
count |
Number of comments |
percentage |
% of total comments |
thematic_summary |
AI-generated summary of themes in that sentiment group |
| Column | Description |
|---|---|
Url |
Post URL that was processed |
Status |
Done N / No comments / Error ... / Invalid Link |
Records |
Number of comments extracted |
π Atomic CSV Writes
All CSV outputs use a write-to-temp β atomic-rename pattern. If the process crashes mid-write, you'll never end up with a corrupt or partial CSV file.
# Write to .tmp first, then atomic rename
tmp = path.with_suffix(path.suffix + ".tmp")
df.to_csv(tmp, index=False)
os.replace(tmp, path) # atomic on same filesystemπ Google Sheets Retry Logic
Sheet writes include automatic retry with configurable attempts and delay. If Sheets API fails after all retries, the CSV backup is already saved β no data loss.
π§± Platform Isolation
Each platform runs in its own thread via ThreadPoolExecutor. If Instagram's Apify actor fails, Facebook/YouTube/Twitter continue unaffected. Errors are captured and reported in the status CSV.
π Per-URL Status Tracking
Every URL gets a status entry: Done 47, No comments, Invalid Link, or Error <message>. This makes it easy to audit which posts succeeded and which need re-processing.
βοΈ YAML-Driven Configuration
No code changes needed to enable/disable platforms, adjust comment limits, or switch AI models. Everything is controlled through params.yaml.
| Layer | Technology | Purpose |
|---|---|---|
| Language | Python 3.10+ | Core runtime |
| Extraction | Apify Client | FB, IG, X comment extraction |
| YouTube | Google YouTube Data API v3 | YT comment extraction |
| AI/LLM | Google Gemini (generativeai) | Sentiment analysis + summarization |
| Data | Pandas | DataFrame processing & manipulation |
| Concurrency | ThreadPoolExecutor | Parallel extraction & analysis |
| I/O | gspread + gspread_dataframe | Google Sheets read/write |
| Auth | oauth2client | Google Service Account auth |
| Config | PyYAML | YAML-based configuration |
| Progress | tqdm | Progress bars for analysis |
| Metric | Typical Value |
|---|---|
| Platforms processed simultaneously | Up to 4 |
| Comments per run | 1,000 β 50,000+ |
| Sentiment analysis throughput | ~10 comments/sec (10 threads) |
| End-to-end pipeline time | 5β30 minutes (depending on volume) |
| Output destinations | CSV + Google Sheets (simultaneous) |
- π All API keys stored in
params.yaml(excluded from version control) - π Google Service Account credentials via JSON keyfile
- π« No hardcoded secrets in source code
- π Credentials file never committed to repository
Built with β€οΈ at Varahe Analytics Pvt. Ltd.
β If you found this interesting, give it a star!