Skip to content

A comprehensive system for monitoring and analyzing tourism data from multiple social media platforms for Vietnamese provinces.

Notifications You must be signed in to change notification settings

TechmoNoway/tourism-data-monitor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Tourism Data Monitor πŸ–οΈ

A comprehensive system for monitoring and analyzing tourism data from multiple social media platforms for Vietnamese provinces.

🎯 Overview

This system collects and analyzes comments, posts, and reviews about tourist attractions from:

  • βœ… YouTube (via YouTube Data API v3)
  • βœ… Google Reviews (via Google Places API)
  • βœ… Facebook (via Apify scraper)
  • βœ… TikTok (via Apify scraper)

Target provinces: LΓ’m Đồng, Đà NαΊ΅ng, BΓ¬nh ThuαΊ­n


πŸš€ Quick Start

1. Prerequisites

  • Python 3.9+
  • PostgreSQL
  • API keys (see setup below)

2. Installation

# Clone repository
git clone <repository-url>
cd tourism_data_monitor

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

3. Setup Credentials

Copy the example env file:

cp .env.example .env

Edit .env and add your API keys:

# Required APIs
YOUTUBE_API_KEY=your_youtube_key_here
GOOGLE_MAPS_API_KEY=your_google_maps_key_here
APIFY_API_TOKEN=apify_api_your_token_here

# Database
DATABASE_URL=sqlite:///./tourism.db

Get API keys:

4. Test Your Setup

python test/test_credentials.py

Expected output:

βœ… YouTube                WORKING
βœ… Google Maps            WORKING
βœ… Apify (FB/TikTok)      WORKING

πŸŽ‰ All credentials are working!

5. Initialize Database

# Create database and seed initial data
python scripts/recreate_db.py

6. Collect Data

# Collect for specific provinces (recommended for testing)
python scripts/collect_data.py --provinces "BΓ¬nh ThuαΊ­n,Đà NαΊ΅ng,LΓ’m Đồng" --limit 3

# Collect for all active attractions
python scripts/collect_data.py --all

7. Verify Data

# Check database statistics
python scripts/check_data.py

# Just verify connection
python scripts/check_data.py --verify

8. Run API Server (Optional)

python run.py
# or
uvicorn app.main:app --reload

Visit: http://localhost:8000/docs for API documentation


πŸ“š Documentation

Getting Started

Scheduling & Deployment

Technical Documentation


πŸ—οΈ Architecture

Tech Stack

  • Backend: FastAPI, SQLAlchemy, Pydantic v2
  • Database: PostgreSQL or SQLite
  • Collectors: YouTube API, Google Places API, Apify scrapers
  • Scheduling: APScheduler
  • NLP: PhoBERT (planned)

Project Structure

tourism_data_monitor/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ api/              # API endpoints
β”‚   β”‚   β”œβ”€β”€ routes.py
β”‚   β”‚   └── endpoints/    # Province, attraction, collection endpoints
β”‚   β”œβ”€β”€ collectors/       # Data collection modules
β”‚   β”‚   β”œβ”€β”€ base_collector.py           # Base class with dict mapping
β”‚   β”‚   β”œβ”€β”€ data_pipeline.py            # Multi-platform orchestrator
β”‚   β”‚   β”œβ”€β”€ facebook_apify_collector.py # Facebook via Apify
β”‚   β”‚   β”œβ”€β”€ tiktok_apify_collector.py   # TikTok via Apify
β”‚   β”‚   β”œβ”€β”€ youtube_collector.py        # YouTube API
β”‚   β”‚   β”œβ”€β”€ google_maps_apify_collector.py # Google Maps via Apify
β”‚   β”‚   β”œβ”€β”€ relevance_filter.py         # Content filtering
β”‚   β”‚   └── scheduler.py                # Automated scheduling
β”‚   β”œβ”€β”€ models/           # SQLAlchemy ORM models
β”‚   β”œβ”€β”€ schemas/          # Pydantic v2 schemas
β”‚   β”œβ”€β”€ services/         # Business logic
β”‚   β”œβ”€β”€ core/             # Configuration
β”‚   β”‚   β”œβ”€β”€ config.py                # Main settings
β”‚   β”‚   └── facebook_best_pages.py   # Facebook pages config
β”‚   └── database/         # Database connection
β”œβ”€β”€ scripts/              # Utility scripts
β”‚   β”œβ”€β”€ collect_data.py   # Main data collection script
β”‚   β”œβ”€β”€ check_data.py     # Database verification
β”‚   β”œβ”€β”€ recreate_db.py    # Database setup/reset
β”‚   └── README.md         # Scripts documentation
β”œβ”€β”€ docs/                 # Documentation
β”œβ”€β”€ test/                 # Tests
β”œβ”€β”€ run.py               # API server entry point
└── requirements.txt

Database Schema

provinces
β”œβ”€β”€ id, name, code

tourist_attractions
β”œβ”€β”€ id, name, province_id
β”œβ”€β”€ description, location

social_posts
β”œβ”€β”€ id, platform, platform_post_id
β”œβ”€β”€ attraction_id, content, author
β”œβ”€β”€ post_date, engagement metrics

comments
β”œβ”€β”€ id, platform, platform_comment_id
β”œβ”€β”€ post_id, attraction_id, content
β”œβ”€β”€ author, comment_date, sentiment

analysis_logs
β”œβ”€β”€ id, attraction_id, analysis_type
β”œβ”€β”€ results, created_at

πŸ’» Usage Examples

Using Command Line Scripts

# Collect data for specific provinces
python scripts/collect_data.py --provinces "Bình Thuận,Đà Nạng" --limit 5

# Collect for all attractions
python scripts/collect_data.py --all

# Check database statistics
python scripts/check_data.py

# Verify connection only
python scripts/check_data.py --verify

Using Python API

from app.collectors.data_pipeline import create_data_pipeline

# Initialize pipeline
pipeline = create_data_pipeline()

# Collect from all platforms for an attraction
await pipeline.collect_for_attraction(
    attraction_id=1,
    platform='google_maps',  # or 'facebook', 'youtube', 'tiktok'
    max_posts=8,
    max_comments=20
)

API Endpoints

# List provinces
GET /api/v1/provinces

# List attractions by province
GET /api/v1/attractions?province_id=1

# Trigger collection
POST /api/v1/collection/collect
{
  "attraction_id": 1,
  "platforms": ["facebook", "google_maps"],
  "limit_per_platform": 50
}

# Get collection status
GET /api/v1/collection/status/{task_id}

🎯 Features

βœ… Implemented

  • Multi-platform data collection (Facebook, Google Maps, TikTok, YouTube)
  • Dict mapping strategy for comment collection on existing posts
  • Automatic duplicate detection (unique constraints)
  • Rate limiting and delay management
  • Platform priority-based collection
  • Target-based stopping (40 comments per attraction)
  • Comprehensive logging and progress reporting
  • Database models with proper relationships
  • Pydantic schemas for validation
  • FastAPI REST API
  • Duplicate detection (UniqueConstraint)
  • Automated scheduling support
  • Comprehensive documentation

πŸ”œ Planned Improvements

Collector Upgrades:

  • Facebook: Direct page URLs only (keyword search blocked)
  • Google Maps: Increase coverage to 20-30 places
  • TikTok: Fix comment collection (currently 0 comments)
  • YouTube: Complete testing and optimization
  • Add best page fallback strategies
  • Implement scraping multiple related pages

Data Quality:

  • NLP-based relevance filtering
  • Sentiment analysis integration
  • Spam/bot detection
  • Comment length filtering
  • Duplicate content detection across platforms

Analytics:

  • PhoBERT integration for Vietnamese NLP
  • Web dashboard for visualization
  • Real-time monitoring
  • Automated report generation
  • Trend analysis

πŸ“Š Current Performance

Latest Collection Results:

  • Attractions processed: 7/9 (2 duplicates in DB)
  • Total posts: 54
  • Total comments: 374
  • Average comments/attraction: 53.4
  • Target achievement: 7/7 attractions β‰₯30 comments βœ…

Platform Performance:

  • Google Maps: Excellent (165 comments from 13 places for one attraction)
  • Facebook: Very good (60+ comments with Best Pages strategy)
  • TikTok: Posts only (0 comments - needs fixing)
  • YouTube: Not yet tested in production

πŸ”§ Configuration

Environment Variables

# Application
DEBUG=True
HOST=0.0.0.0
PORT=8000

# Database
DATABASE_URL=postgresql://user:pass@localhost/tourism_db

# APIs (all required)
YOUTUBE_API_KEY=your_key
GOOGLE_MAPS_API_KEY=your_key
APIFY_API_TOKEN=your_token

# Scheduler (optional)
SCHEDULER_ENABLED=False
DAILY_COLLECTION_HOUR=2
DAILY_COLLECTION_MINUTE=0

# Collection Limits
DEFAULT_POSTS_LIMIT=50
DEFAULT_COMMENTS_LIMIT=100

πŸ“„ License

[Add your license here]


πŸ™ Acknowledgments

  • Apify - For reliable web scraping platform
  • Google - For YouTube and Maps APIs
  • FastAPI - For excellent web framework
  • SQLAlchemy - For powerful ORM

About

A comprehensive system for monitoring and analyzing tourism data from multiple social media platforms for Vietnamese provinces.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published