Automated data collection and analysis of YouTube trending videos across 27 countries
- 🎯 Project Overview
- 🌍 Countries Analyzed
- ⚡ Quick Start
- 🔄 Data Pipeline
- 📊 Data Structure
- 🤖 Automation
- 📈 Analytics Features
- 🛠️ Technical Details
- 📁 Project Structure
YouTube Trending Analytics is an automated data collection and analysis system that tracks trending videos across 27 countries in real-time. The system captures comprehensive video metadata, engagement metrics, and channel information to enable advanced content performance analysis.
| Objective | Status | Description |
|---|---|---|
| 📊 Data Collection | ✅ Active | Daily automated collection from 27 countries |
| 🗄️ Historical Database | ✅ Active | 999-day retention with comprehensive metrics |
| 🔍 Performance Analysis | ✅ Ready | 65+ fields per video for ML analysis |
| 🌐 Multi-Language Support | ✅ Active | Spanish & English speaking countries |
| 🚀 Real-time Processing | ✅ Active | GitHub Actions automation |
Our system monitors trending videos across 27 countries organized into language groups:
| 🇦🇷 Argentina | 🇧🇴 Bolivia | 🇨🇱 Chile | 🇨🇴 Colombia | 🇨🇷 Costa Rica |
| 🇩🇴 Dominican Rep. | 🇪🇨 Ecuador | 🇪🇸 Spain | 🇬🇹 Guatemala | 🇭🇳 Honduras |
| 🇲🇽 Mexico | 🇳🇮 Nicaragua | 🇵🇦 Panama | 🇵🇪 Peru | 🇵🇷 Puerto Rico |
| 🇵🇾 Paraguay | 🇸🇻 El Salvador | 🇺🇾 Uruguay | 🇻🇪 Venezuela |
| 🇦🇺 Australia | 🇨🇦 Canada | 🇬🇧 United Kingdom | 🇮🇪 Ireland |
| 🇯🇲 Jamaica | 🇳🇿 New Zealand | 🇸🇬 Singapore | 🇺🇸 United States |
| Data Type | Location | Update Frequency |
|---|---|---|
| 🎬 Trending Videos | assets/meta/trending/ |
Daily at 1 AM PST |
| 📈 Video Statistics | assets/meta/video_stats/ |
Daily (after trending) |
| 📋 Consolidated CSV | db/ods/trending_videos.csv |
Daily |
| 🌍 Worldwide Data | assets/meta/trending/languages/www/ |
Daily |
# Clone the repository
git clone https://github.com/Root-FTW/YT_DB_Trending.git
cd YT_DB_Trending
# Install dependencies
pip install -r src/requirements.txt
# Set up YouTube API key
export YOUTUBE_API_KEY="your_api_key_here"
# Run data collection (manual)
python src/collection/trending.py
python src/collection/video_stats.pyOur automated data collection pipeline runs daily and processes data through multiple stages:
graph TD
A[🎬 trending.py<br/>Fetch Trending Videos<br/>5 API Parts] -->|27 Country JSONs| B[🌐 trending_consolidator.py<br/>Language Consolidation]
A -->|Daily Files| C[📊 trending_db.py<br/>Aggregate & Clean Data]
B -->|Spanish Group| D[🇪🇸 Spanish Consolidated<br/>19 Countries]
B -->|English Group| E[🇬🇧 English Consolidated<br/>8 Countries]
B -->|Global| F[🌍 Worldwide Consolidated<br/>All 27 Countries]
C -->|Unified CSV| G[📈 video_stats.py<br/>Detailed Video Analytics<br/>10 API Parts + Channel Data]
G -->|65+ Fields| H[📋 Daily Video Stats JSON<br/>Ready for ML Analysis]
style A fill:#ff6b6b,color:#fff
style B fill:#4ecdc4,color:#fff
style C fill:#45b7d1,color:#fff
style G fill:#96ceb4,color:#fff
style H fill:#feca57,color:#000
style F fill:#ff9ff3,color:#000
| Stage | Script | Input | Output | Frequency |
|---|---|---|---|---|
| 1️⃣ Collection | trending.py |
YouTube API | 27 country JSON files | Daily 1 AM PST |
| 2️⃣ Consolidation | trending_consolidator.py |
Country JSONs | Language group files | After stage 1 |
| 3️⃣ Aggregation | trending_db.py |
All JSONs | Unified CSV (999 days) | After stage 2 |
| 4️⃣ Enhancement | video_stats.py |
CSV video IDs | Detailed stats JSON | After stage 3 |
Our system captures comprehensive data at multiple levels:
YT_DB_Trending/
├── 📂 assets/meta/trending/
│ ├── 📂 countries/ # Individual country data
│ │ ├── 📂 AR/ # Argentina files
│ │ ├── 📂 US/ # United States files
│ │ └── ... # (27 countries total)
│ └── 📂 languages/ # Consolidated data
│ ├── 📂 ES/ # Spanish-speaking consolidation
│ ├── 📂 EN/ # English-speaking consolidation
│ └── 📂 www/ # Worldwide consolidation
├── 📂 assets/meta/video_stats/ # Detailed video analytics
├── 📂 db/ods/ # Processed datasets
│ └── 📄 trending_videos.csv # Unified trending data
└── 📂 src/ # Source code
├── 📂 collection/ # Data collection scripts
└── 📂 processing/ # Data processing scripts
🎬 Trending Video Data (per country)
| Field | Type | Description |
|---|---|---|
id |
String | Unique YouTube video ID |
trending_position |
Integer | Position in trending list (1-50) |
collection_date |
Date | When data was collected |
country_code |
String | Country code (AR, US, etc.) |
title |
String | Video title |
channelTitle |
String | Channel name |
viewCount |
Integer | Total views |
likeCount |
Integer | Total likes |
commentCount |
Integer | Total comments |
categoryId |
String | YouTube category |
publishedAt |
DateTime | Video publication date |
thumbnail_url |
String | High-quality thumbnail URL |
📈 Enhanced Video Statistics (65+ fields)
Basic Metrics:
- Views, likes, comments, favorites
- Duration, resolution, category
- Publication date, language
Channel Intelligence:
- Subscriber count, total videos
- Channel country, keywords
- Topic categories, status
Calculated Metrics:
engagement_rate: (likes + comments) / viewsviews_to_subscribers_ratio: views / subscriber_countlikes_to_views_ratio: likes / viewscomments_to_views_ratio: comments / views
Technical Details:
- File size, container format
- Video/audio streams, bitrate
- Processing status, quality indicators
Our system runs automatically using GitHub Actions:
| Workflow | Trigger | Schedule | Duration |
|---|---|---|---|
| 🎬 Trending Collection | Daily | 1:00 AM PST | ~5 minutes |
| 📈 Video Stats Collection | After trending | Dependent | ~10 minutes |
| 🧪 Code Quality | Push/PR | On-demand | ~2 minutes |
📊 Daily YouTube Data Pipeline
Trigger: 0 9 * * * (1 AM PST daily)
Steps:
- 🔄 Checkout repository
- 🐍 Setup Python 3.9 environment
- 📦 Install dependencies
- 🔑 Configure YouTube API key
- 🎬 Run
trending.py(fetch trending videos) - 🌐 Run
trending_consolidator.py(language consolidation) - 📊 Run
trending_db.py(aggregate data) - 💾 Commit and push changes
Output Files:
- 27 country JSON files
- 3 language consolidation files
- 1 worldwide consolidation file
- 1 unified CSV file
📈 Daily Video Statistics Collector
Trigger: After "Daily YouTube Data Pipeline" completes successfully
Steps:
- 🔄 Checkout repository
- 🐍 Setup Python 3.9 environment
- 📦 Install dependencies
- 🔑 Configure YouTube API key
- 📈 Run
video_stats.py(detailed analytics) - 💾 Commit and push changes
Output Files:
- Daily video statistics JSON (65+ fields per video)
| Data Type | Retention Period | Storage Location |
|---|---|---|
| Trending JSONs | 999 days (~2.7 years) | assets/meta/trending/ |
| Video Stats JSONs | 999 days (~2.7 years) | assets/meta/video_stats/ |
| Consolidated CSV | All historical data | db/ods/trending_videos.csv |
Our dataset enables advanced content performance analysis through comprehensive metrics:
| Metric Category | Available Metrics | Use Case |
|---|---|---|
| 📊 Engagement | Likes, comments, engagement rate | Audience interaction analysis |
| 👥 Audience Reach | Views-to-subscribers ratio | Content reach analysis |
| 🌍 Geographic Spread | Trending positions across countries | Global appeal measurement |
| 📺 Channel Context | Subscriber count, channel history | Relative performance analysis |
| ⚡ Content Quality | Technical specs, processing status | Content optimization insights |
| Indicator | Formula | Interpretation |
|---|---|---|
| Engagement Rate | (likes + comments) / views |
Higher = More engaging content |
| Audience Reach | views / subscriber_count |
Higher = Greater content reach |
| Geographic Appeal | countries_trending / 27 |
Higher = Global relevance |
| Trending Velocity | average_position |
Lower = Better performance |
High-Performance Video Detection:
- Video with 10K subscribers getting 2M views → High performance potential
- Video trending in 15+ countries → Global appeal
- Engagement rate > 5% → Highly engaging content
Channel Performance:
- Compare views-to-subscribers across similar channels
- Analyze trending frequency by country/language
- Track engagement patterns over time
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Language | Python | 3.9 | Core development |
| API | YouTube Data API | v3 | Data collection |
| Automation | GitHub Actions | Latest | Workflow orchestration |
| Data Processing | pandas | Latest | Data manipulation |
| Configuration | JSON | - | Settings management |
YouTube Data API Parts Used:
🎬 Trending Collection (5 parts)
snippet- Basic video informationstatistics- View, like, comment countscontentDetails- Duration, resolutionstatus- Privacy settingstopicDetails- Content categorization
📈 Video Statistics (10 parts)
- All trending parts +
fileDetails- Technical file informationprocessingDetails- Processing statussuggestions- Quality recommendationslocalizations- Multi-language contentliveStreamingDetails- Live streaming data
The system is configured via config.json:
{
"TRENDING_METADATA_LOC": "assets/meta/trending",
"TRENDING_ODS_DIR": "db/ods/",
"TRENDING_COUNTRY_CODES": [
"AR", "BO", "CL", "CO", "CR", "DO", "EC", "ES",
"GT", "HN", "MX", "NI", "PA", "PE", "PR", "PY",
"SV", "UY", "VE", "AU", "CA", "GB", "IE", "JM",
"NZ", "SG", "US"
],
"VIDEO_STATS_METADATA_LOC": "assets/meta/video_stats"
}YT_DB_Trending/
├── 📄 README.md # This documentation
├── 📄 LICENSE # MIT License
├── 📄 config.json # Configuration settings
├── 📂 .github/workflows/ # GitHub Actions
│ ├── 📄 tube_data_collection_pipeline.yml
│ ├── 📄 daily_video_stats_collector.yml
│ └── 📄 python-app.yml
├── 📂 src/ # Source code
│ ├── 📄 requirements.txt # Python dependencies
│ ├── 📂 collection/ # Data collection scripts
│ │ ├── 📄 trending.py # Fetch trending videos
│ │ ├── 📄 trending_consolidator.py # Language consolidation
│ │ └── 📄 video_stats.py # Detailed video analytics
│ └── 📂 processing/ # Data processing scripts
│ └── 📄 trending_db.py # Data aggregation & cleaning
├── 📂 assets/meta/ # Generated data
│ ├── 📂 trending/ # Trending video data
│ │ ├── 📂 countries/ # Per-country files
│ │ └── 📂 languages/ # Consolidated files
│ └── 📂 video_stats/ # Detailed analytics
├── 📂 db/ods/ # Processed datasets
│ └── 📄 trending_videos.csv # Unified trending data
└── 📂 analysis/ # Analysis notebooks
├── 📄 README.md
└── 📄 YouTube Performance Predictor.ipynb