A comprehensive system for monitoring and analyzing tourism data from multiple social media platforms for Vietnamese provinces.
This system collects and analyzes comments, posts, and reviews about tourist attractions from:
- β YouTube (via YouTube Data API v3)
- β Google Reviews (via Google Places API)
- β Facebook (via Apify scraper)
- β TikTok (via Apify scraper)
Target provinces: LΓ’m Δα»ng, ΔΓ NαΊ΅ng, BΓ¬nh ThuαΊn
- Python 3.9+
- PostgreSQL
- API keys (see setup below)
# Clone repository
git clone <repository-url>
cd tourism_data_monitor
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtCopy the example env file:
cp .env.example .envEdit .env and add your API keys:
# Required APIs
YOUTUBE_API_KEY=your_youtube_key_here
GOOGLE_MAPS_API_KEY=your_google_maps_key_here
APIFY_API_TOKEN=apify_api_your_token_here
# Database
DATABASE_URL=sqlite:///./tourism.dbGet API keys:
- YouTube & Google Maps: docs/API_CREDENTIALS_SETUP.md
- Apify: docs/APIFY_SETUP.md or docs/APIFY_QUICKSTART.md
python test/test_credentials.pyExpected output:
β
YouTube WORKING
β
Google Maps WORKING
β
Apify (FB/TikTok) WORKING
π All credentials are working!
# Create database and seed initial data
python scripts/recreate_db.py# Collect for specific provinces (recommended for testing)
python scripts/collect_data.py --provinces "BΓ¬nh ThuαΊn,ΔΓ NαΊ΅ng,LΓ’m Δα»ng" --limit 3
# Collect for all active attractions
python scripts/collect_data.py --all# Check database statistics
python scripts/check_data.py
# Just verify connection
python scripts/check_data.py --verifypython run.py
# or
uvicorn app.main:app --reloadVisit: http://localhost:8000/docs for API documentation
- QUICKSTART.md - Quick setup guide
- APIFY_QUICKSTART.md - 5-minute Apify setup (START HERE!)
- API_CREDENTIALS_SETUP.md - YouTube & Google Maps setup
- SCHEDULER_INTEGRATION.md - π Scheduler integrated with FastAPI
- WINDOWS_DEV_QUICKSTART.md - Windows development mode
- SCHEDULER_CONFIG_SUMMARY.md - Scheduler configuration options
- SCHEDULING_GUIDE.md - Compare scheduling methods
- deployment/DEPLOYMENT_GUIDE.md - Production deployment (Linux)
- APIFY_SETUP.md - Complete Apify setup guide
- COLLECTOR_CHANGES.md - Platform changes and migration
- APIFY_INTEGRATION_SUMMARY.md - Technical summary
- Backend: FastAPI, SQLAlchemy, Pydantic v2
- Database: PostgreSQL or SQLite
- Collectors: YouTube API, Google Places API, Apify scrapers
- Scheduling: APScheduler
- NLP: PhoBERT (planned)
tourism_data_monitor/
βββ app/
β βββ api/ # API endpoints
β β βββ routes.py
β β βββ endpoints/ # Province, attraction, collection endpoints
β βββ collectors/ # Data collection modules
β β βββ base_collector.py # Base class with dict mapping
β β βββ data_pipeline.py # Multi-platform orchestrator
β β βββ facebook_apify_collector.py # Facebook via Apify
β β βββ tiktok_apify_collector.py # TikTok via Apify
β β βββ youtube_collector.py # YouTube API
β β βββ google_maps_apify_collector.py # Google Maps via Apify
β β βββ relevance_filter.py # Content filtering
β β βββ scheduler.py # Automated scheduling
β βββ models/ # SQLAlchemy ORM models
β βββ schemas/ # Pydantic v2 schemas
β βββ services/ # Business logic
β βββ core/ # Configuration
β β βββ config.py # Main settings
β β βββ facebook_best_pages.py # Facebook pages config
β βββ database/ # Database connection
βββ scripts/ # Utility scripts
β βββ collect_data.py # Main data collection script
β βββ check_data.py # Database verification
β βββ recreate_db.py # Database setup/reset
β βββ README.md # Scripts documentation
βββ docs/ # Documentation
βββ test/ # Tests
βββ run.py # API server entry point
βββ requirements.txt
provinces
βββ id, name, code
tourist_attractions
βββ id, name, province_id
βββ description, location
social_posts
βββ id, platform, platform_post_id
βββ attraction_id, content, author
βββ post_date, engagement metrics
comments
βββ id, platform, platform_comment_id
βββ post_id, attraction_id, content
βββ author, comment_date, sentiment
analysis_logs
βββ id, attraction_id, analysis_type
βββ results, created_at
# Collect data for specific provinces
python scripts/collect_data.py --provinces "BΓ¬nh ThuαΊn,ΔΓ NαΊ΅ng" --limit 5
# Collect for all attractions
python scripts/collect_data.py --all
# Check database statistics
python scripts/check_data.py
# Verify connection only
python scripts/check_data.py --verifyfrom app.collectors.data_pipeline import create_data_pipeline
# Initialize pipeline
pipeline = create_data_pipeline()
# Collect from all platforms for an attraction
await pipeline.collect_for_attraction(
attraction_id=1,
platform='google_maps', # or 'facebook', 'youtube', 'tiktok'
max_posts=8,
max_comments=20
)# List provinces
GET /api/v1/provinces
# List attractions by province
GET /api/v1/attractions?province_id=1
# Trigger collection
POST /api/v1/collection/collect
{
"attraction_id": 1,
"platforms": ["facebook", "google_maps"],
"limit_per_platform": 50
}
# Get collection status
GET /api/v1/collection/status/{task_id}- Multi-platform data collection (Facebook, Google Maps, TikTok, YouTube)
- Dict mapping strategy for comment collection on existing posts
- Automatic duplicate detection (unique constraints)
- Rate limiting and delay management
- Platform priority-based collection
- Target-based stopping (40 comments per attraction)
- Comprehensive logging and progress reporting
- Database models with proper relationships
- Pydantic schemas for validation
- FastAPI REST API
- Duplicate detection (UniqueConstraint)
- Automated scheduling support
- Comprehensive documentation
Collector Upgrades:
- Facebook: Direct page URLs only (keyword search blocked)
- Google Maps: Increase coverage to 20-30 places
- TikTok: Fix comment collection (currently 0 comments)
- YouTube: Complete testing and optimization
- Add best page fallback strategies
- Implement scraping multiple related pages
Data Quality:
- NLP-based relevance filtering
- Sentiment analysis integration
- Spam/bot detection
- Comment length filtering
- Duplicate content detection across platforms
Analytics:
- PhoBERT integration for Vietnamese NLP
- Web dashboard for visualization
- Real-time monitoring
- Automated report generation
- Trend analysis
Latest Collection Results:
- Attractions processed: 7/9 (2 duplicates in DB)
- Total posts: 54
- Total comments: 374
- Average comments/attraction: 53.4
- Target achievement: 7/7 attractions β₯30 comments β
Platform Performance:
- Google Maps: Excellent (165 comments from 13 places for one attraction)
- Facebook: Very good (60+ comments with Best Pages strategy)
- TikTok: Posts only (0 comments - needs fixing)
- YouTube: Not yet tested in production
# Application
DEBUG=True
HOST=0.0.0.0
PORT=8000
# Database
DATABASE_URL=postgresql://user:pass@localhost/tourism_db
# APIs (all required)
YOUTUBE_API_KEY=your_key
GOOGLE_MAPS_API_KEY=your_key
APIFY_API_TOKEN=your_token
# Scheduler (optional)
SCHEDULER_ENABLED=False
DAILY_COLLECTION_HOUR=2
DAILY_COLLECTION_MINUTE=0
# Collection Limits
DEFAULT_POSTS_LIMIT=50
DEFAULT_COMMENTS_LIMIT=100[Add your license here]
- Apify - For reliable web scraping platform
- Google - For YouTube and Maps APIs
- FastAPI - For excellent web framework
- SQLAlchemy - For powerful ORM