Data Warehouse, ETL, and Analytics for OpenStreetMap Notes
This repository contains the analytics and data warehouse components for the OSM Notes profiling system. It provides ETL (Extract, Transform, Load) processes, a star schema data warehouse, and datamarts for analyzing OSM notes data.
- Star Schema Data Warehouse: Comprehensive dimensional model for notes analysis
- Enhanced ETL Process: Robust ETL with recovery, validation, and monitoring
- Partitioned Facts Table: Automatic partitioning by year (2013-2025+)
- Country Datamart: Pre-computed analytics by country (70+ metrics)
- User Datamart: Pre-computed analytics by user (70+ metrics)
- Profile Generator: Generate country and user profiles
- Advanced Dimensions:
- Timezones for local time analysis
- Seasons based on date and latitude
- Continents for geographical grouping
- Application versions tracking
- SCD2 for username changes
- Automation level detection
- User experience levels
- Hashtags (unlimited via bridge table)
- Enhanced Metrics (October 2025):
- Resolution metrics: avg/median days to resolution, resolution rate
- Application statistics: mobile/desktop apps count, most used app
- Content quality: comment length, URL/mention detection, engagement metrics
- Community health: active notes, backlog, age distribution, recent activity
- Comprehensive Testing: 168+ automated tests (90%+ function coverage)
- PostgreSQL 12 or higher
- PostGIS 3.0 or higher
- Bash 4.0 or higher
- OSM Notes Ingestion Database: This analytics system reads from the base notes tables populated by the OSM-Notes-Ingestion ingestion system
This system uses a shared database approach with separate schemas:
Database: osm_notes
├── Schema: public # Base tables (managed by Ingestion repo)
│ ├── notes
│ ├── note_comments
│ ├── note_comments_text
│ ├── users
│ └── countries
└── Schema: dwh # Data Warehouse (managed by this repo)
├── facts # Fact table
├── dimension_* # Dimension tables
└── datamart_* # Datamart tables
This guide walks you through the complete process from scratch to having exportable JSON datamarts.
1. Base Data → 2. ETL/DWH → 3. Datamarts → 4. JSON Export
(notes) (facts, dims) (aggregations) (web viewer)
git clone https://github.com/OSMLatam/OSM-Notes-Analytics.git
cd OSM-Notes-AnalyticsEdit etc/properties.sh with your database credentials:
# Database configuration
DBNAME="osm_notes" # Same database as Ingestion
DB_USER="myuser"First, ensure you have the base OSM notes data:
psql -d notes -c "SELECT COUNT(*) FROM notes"
psql -d notes -c "SELECT COUNT(*) FROM note_comments"
psql -d notes -c "SELECT COUNT(*) FROM note_comments_text"If these tables are empty or don't exist, you need to load the OSM notes data first using the OSM-Notes-Ingestion system.
The ETL creates the data warehouse (schema dwh) with:
- Fact tables (partitioned by year for optimal performance)
- Dimension tables (users, countries, dates, etc.)
- All necessary transformations
- Automatically updates datamarts
Initial Load (first time, complete data):
cd bin/dwh
./ETL.shIncremental Update (regular operations, new data only):
cd bin/dwh
./ETL.sh incrementalThis process can take several hours for initial load, 15-30 minutes for incremental updates.
What the ETL does automatically:
- Creates schema
dwhwith all tables - Creates automatic partitions for facts table (2013-2025+)
- Populates dimension tables
- Loads facts from note_comments
- Creates indexes and constraints
- Updates datamarts (countries and users)
- Creates specialized views for hashtag analytics
- Calculates automation levels for users
- Updates experience levels for users
- Creates note activity metrics (comment counts, reopenings)
Check that the data warehouse was created:
# Check schema exists
psql -d notes -c "SELECT schema_name FROM information_schema.schemata WHERE schema_name = 'dwh'"
# Check tables exist
psql -d notes -c "SELECT tablename FROM pg_tables WHERE schemaname = 'dwh' ORDER BY tablename"
# Check fact counts
psql -d notes -c "SELECT COUNT(*) FROM dwh.facts"
psql -d notes -c "SELECT COUNT(*) FROM dwh.dimension_users"
psql -d notes -c "SELECT COUNT(*) FROM dwh.dimension_countries"✅ Datamarts are automatically updated during ETL execution. No manual step needed!
The datamarts aggregate data for quick access and are automatically populated after ETL completes.
Manual Update (only if needed):
# Update users datamart
cd bin/dwh/datamartUsers
./datamartUsers.sh
# Update countries datamart
cd bin/dwh/datamartCountries
./datamartCountries.shNote: Datamarts process incrementally (only modified entities) for optimal performance.
# Check datamart tables exist
psql -d notes -c "SELECT tablename FROM pg_tables WHERE schemaname = 'dwh' AND tablename LIKE 'datamart%'"
# Check counts
psql -d notes -c "SELECT COUNT(*) FROM dwh.datamartusers"
psql -d notes -c "SELECT COUNT(*) FROM dwh.datamartcountries"
# View sample user
psql -d notes -c "SELECT user_id, username, history_whole_open, history_whole_closed FROM dwh.datamartusers LIMIT 5"
# View sample country
psql -d notes -c "SELECT country_id, country_name_en, history_whole_open, history_whole_closed FROM dwh.datamartcountries LIMIT 5"Once datamarts are populated, export to JSON for the web viewer:
cd bin/dwh
./exportDatamartsToJSON.shThis creates JSON files in ./output/json/:
- Individual files per user:
users/{user_id}.json - Individual files per country:
countries/{country_id}.json - Index files:
indexes/users.json,indexes/countries.json - Metadata:
metadata.json
Export Features:
- Atomic writes: Files generated in temporary directory, validated, then moved atomically
- Schema validation: Each JSON file validated against schemas before export
- Fail-safe: On validation failure, keeps existing files and logs error
- No partial updates: Either all files are valid and moved, or none
See JSON Export Documentation and Atomic Validation Export for complete details.
For ongoing updates, run these in sequence:
# 1. Update base data (your OSM notes import process)
# 2. Update DWH
cd bin/dwh
./ETL.sh --incremental
# 3. Update datamarts
cd datamartUsers
./datamartUsers.sh
cd ../datamartCountries
./datamartCountries.sh
# 4. Export JSON (optional)
cd ..
./exportDatamartsToJSON.sh
# Note: The export script validates all JSON files before moving them to the final destination.
# If validation fails, it keeps existing files and exits with an error, ensuring data integrity.For automated analytics updates:
# Update ETL every hour (after ingestion completes)
0 * * * * ~/OSM-Notes-Analytics/bin/dwh/ETL.sh --incremental
# Update country datamart daily
0 2 * * * ~/OSM-Notes-Analytics/bin/dwh/datamartCountries/datamartCountries.sh
# Update user datamart daily (processes 500 users per run)
30 2 * * * ~/OSM-Notes-Analytics/bin/dwh/datamartUsers/datamartUsers.sh
# Export to JSON and push to GitHub Pages (every 15 minutes, after datamarts update)
# This script exports JSON files and automatically deploys them to GitHub Pages
45 2 * * * ~/OSM-Notes-Analytics/bin/dwh/exportAndPushToGitHub.shFor a complete automated pipeline that includes JSON export with validation:
# Create wrapper script: /opt/osm-analytics/update-and-export.sh
#!/bin/bash
cd /opt/osm-analytics/OSM-Notes-Analytics
# ETL incremental update
./bin/dwh/ETL.sh --incremental || exit 1
# Update datamarts
./bin/dwh/datamartUsers/datamartUsers.sh || exit 1
./bin/dwh/datamartCountries/datamartCountries.sh || exit 1
# Export to JSON and push to GitHub Pages
# The script exports JSON files and automatically deploys them to GitHub Pages
./bin/dwh/exportAndPushToGitHub.sh || exit 1
# If we get here, all files are valid and exported
echo "SUCCESS: All exports validated and moved to destination"Then schedule this wrapper:
# Run complete pipeline every 15 minutes
*/15 * * * * /opt/osm-analytics/update-and-export.sh >> /var/log/osm-analytics.log 2>&1Key features of JSON export:
- ✅ Atomic writes: Files are generated in temporary directory first
- ✅ Schema validation: Each JSON file is validated before final export
- ✅ Fail-safe: On validation failure, keeps existing files and exits with error
- ✅ No partial updates: Either all files are valid and moved, or none
OSM-Notes-Analytics/
├── bin/ # Executable scripts
│ ├── dwh/ # ETL and datamart scripts
│ │ ├── ETL.sh # Main ETL process
│ │ ├── profile.sh # Profile generator
│ │ ├── cleanupDWH.sh # Data warehouse cleanup script
│ │ ├── README.md # DWH scripts documentation
│ │ ├── datamartCountries/
│ │ │ └── datamartCountries.sh
│ │ └── datamartUsers/
│ │ └── datamartUsers.sh
│ └── README.md # Scripts documentation
├── etc/ # Configuration files
│ ├── properties.sh # Database configuration
│ ├── etl.properties # ETL configuration
│ └── README.md # Configuration documentation
├── sql/ # SQL scripts
│ ├── dwh/ # DWH DDL and procedures
│ │ ├── ETL_*.sql # ETL scripts
│ │ ├── Staging_*.sql # Staging procedures
│ │ ├── datamartCountries/ # Country datamart SQL
│ │ └── datamartUsers/ # User datamart SQL
│ └── README.md # SQL documentation
├── scripts/ # Utility scripts
│ ├── install-hooks.sh # Git hooks installer
│ ├── setup_analytics.sh # Initial setup script
│ ├── validate-all.sh # Validation script
│ └── README.md # Scripts documentation
├── tests/ # Test suites
│ ├── unit/ # Unit tests
│ │ ├── bash/ # Bash script tests
│ │ └── sql/ # SQL tests
│ ├── integration/ # Integration tests
│ ├── run_all_tests.sh # Run all tests
│ ├── run_dwh_tests.sh # Run DWH tests
│ ├── run_quality_tests.sh # Run quality tests
│ └── README.md # Testing documentation
├── docs/ # Documentation
│ ├── DWH_Star_Schema_ERD.md # Star schema diagrams
│ ├── DWH_Star_Schema_Data_Dictionary.md # Data dictionary
│ ├── ETL_Enhanced_Features.md # ETL features
│ ├── CI_CD_Guide.md # CI/CD workflows
│ └── README.md # Documentation index
└── lib/ # Shared libraries
├── osm-common/ # Common OSM utilities
│ ├── bash_logger.sh
│ ├── commonFunctions.sh
│ ├── validationFunctions.sh
│ ├── errorHandlingFunctions.sh
│ └── consolidatedValidationFunctions.sh
└── README.md # Library documentation
./bin/dwh/ETL.sh --createCreates the complete data warehouse from scratch, including all dimensions and facts.
./bin/dwh/ETL.sh --incrementalProcesses only new data since the last ETL run. Use this for scheduled updates.
dwh.facts: Central fact table containing note actions and metrics- Partitioned by year (action_at) for optimal performance
- Automatic partition creation for current and future years
- Each year stored in separate partition (e.g.,
facts_2024,facts_2025) - 10-50x faster queries when filtering by date
dimension_users: User information with SCD2 supportdimension_countries: Countries with ISO codes and regionsdimension_regions: Geographic regionsdimension_continents: Continental groupingdimension_days: Date dimension with enhanced attributesdimension_time_of_week: Hour of week with period of daydimension_applications: Applications used to create notesdimension_application_versions: Application version trackingdimension_hashtags: Hashtags found in notesdimension_timezones: Timezone informationdimension_seasons: Seasons based on date and latitude
-
datamart_countries: Pre-computed country analytics (70+ metrics)- Historical metrics: notes opened/closed by country
- Resolution metrics: avg/median days to resolution, resolution rate
- Application statistics: mobile/desktop app usage, most used app
- Content quality: comment length, URLs, mentions, engagement
- Community health: active notes, backlog size, age distribution
- Hashtag analysis: top hashtags, usage patterns
-
datamart_users: Pre-computed user analytics (70+ metrics)- Historical metrics: notes opened/closed by user
- Resolution metrics: avg/median days to resolution, resolution rate
- Application statistics: mobile/desktop app usage
- Content quality: comment length, URLs, mentions, engagement
- Community health: active notes, recent activity
- Automation level: human/automated detection
- Experience level: beginner to legendary contributor
The dwh.facts table is partitioned by year using the action_at column:
- Automatic partition management: The ETL automatically creates partitions for:
- Current year (always verified)
- Next year (to prevent failures on year transition)
- One additional year ahead (buffer)
- Zero maintenance: No manual intervention needed when the year changes
- Performance benefits:
- 10-50x faster queries when filtering by date
- Only scans relevant year partitions (PostgreSQL partition pruning)
- Faster VACUUM and maintenance operations per partition
- Easy archival: Old year partitions can be detached/archived independently
See docs/partitioning_strategy.md for complete details.
- ETL Initial Load: ~30 hours (parallel by year since 2013)
- Country Datamart: ~20 minutes
- User Datamart: ~5 days (500 users per run, asynchronous)
- Memory: 4GB+ recommended for ETL
- Disk Space: Depends on notes volume (GB scale)
- CPU: Multi-core recommended for parallel processing
# Quality tests (fast, no database required)
./tests/run_quality_tests.sh
# DWH tests (requires database 'dwh')
./tests/run_dwh_tests.sh
# All tests
./tests/run_all_tests.shThis project includes comprehensive CI/CD with:
- ✅ GitHub Actions workflows for automated testing
- ✅ Pre-commit hooks for code quality
- ✅ Pre-push hooks for full validation
- ✅ Automated dependency checking
Install git hooks:
./scripts/install-hooks.shFull validation:
./scripts/validate-all.shSee CI/CD Guide for complete documentation.
The ETL process creates detailed logs:
# Follow ETL progress
tail -40f $(ls -1rtd /tmp/ETL_* | tail -1)/ETL.log
# Set log level
export LOG_LEVEL=DEBUG
./bin/dwh/ETL.sh --incrementalAvailable log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL
The project includes a cleanup script for maintenance and troubleshooting:
# Safe operations (no data loss):
./bin/dwh/cleanupDWH.sh --remove-temp-files # Clean temporary files
./bin/dwh/cleanupDWH.sh --dry-run # Preview operations
# Destructive operations (require confirmation):
./bin/dwh/cleanupDWH.sh # Full cleanup
./bin/dwh/cleanupDWH.sh --remove-all-data # DWH objects onlyWhen to use:
- Development: Clean temporary files regularly
- Testing: Reset environment between test runs
- Troubleshooting: Remove corrupted objects
- Maintenance: Free disk space
--dry-run first.
For detailed maintenance procedures, see DWH Maintenance Guide.
Solution: Run ./bin/dwh/ETL.sh --create first to create the data warehouse.
Solution: Run the datamart scripts:
bin/dwh/datamartUsers/datamartUsers.shbin/dwh/datamartCountries/datamartCountries.sh
The ETL processes data by year in parallel. Adjust parallelism in etc/properties.sh:
MAX_THREADS=8 # Increase for more coresDatamarts process entities incrementally (500 at a time). Run the script multiple times:
# Keep running until it says "0 users processed"
while true; do
./datamartUsers.sh
sleep 5
doneOr check the modified flag:
SELECT COUNT(*) FROM dwh.dimension_users WHERE modified = TRUE;When it returns 0, all users are processed.
Ensure datamarts have data:
SELECT COUNT(*) FROM dwh.datamartusers;
SELECT COUNT(*) FROM dwh.datamartcountries;If counts are 0, re-run the datamart population scripts.
Check that:
- Database connection is configured correctly
- Base tables exist (populated by Ingestion system)
- PostgreSQL is running and accessible
- Increase
ETL_MAX_PARALLEL_JOBSinetc/etl.properties - Adjust
ETL_BATCH_SIZEfor better throughput - Run
VACUUM ANALYZEon base tables
Check for orphaned facts (example query):
psql -d osm_notes -c "SELECT COUNT(*) FROM dwh.facts f
LEFT JOIN dwh.dimension_countries c ON f.dimension_id_country = c.dimension_country_id
WHERE c.dimension_country_id IS NULL;"This analytics system depends on the OSM-Notes-Ingestion ingestion system:
-
Ingestion (OSM-Notes-Ingestion)
- Downloads notes from OSM Planet and API
- Populates base tables:
notes,note_comments,note_comments_text - Manages WMS layer publication
-
Analytics (this repository)
- Reads from base tables
- Transforms data into star schema
- Generates datamarts and profiles
Deployment Order:
- Deploy and run Ingestion system first
- Wait for base tables to be populated
- Deploy and run Analytics system
- DWH Star Schema ERD: Entity-relationship diagram
- Data Dictionary: Complete schema documentation
- ETL Enhanced Features: Advanced ETL capabilities
- CI/CD Guide: CI/CD workflows and git hooks
- Testing Guide: Complete testing documentation
# Database configuration
DBNAME="osm_notes"
DB_USER="myuser"
# Processing configuration
LOOP_SIZE="10000"
MAX_THREADS="4"# Performance
ETL_BATCH_SIZE=1000
ETL_PARALLEL_ENABLED=true
ETL_MAX_PARALLEL_JOBS=4
# Resource Control
MAX_MEMORY_USAGE=80
MAX_DISK_USAGE=90
ETL_TIMEOUT=7200
# Recovery
ETL_RECOVERY_ENABLED=true
# Validation
ETL_VALIDATE_INTEGRITY=trueSee CONTRIBUTING.md for guidelines on contributing to this project.
See LICENSE for license information.
- Andres Gomez (@AngocA): Main developer
- Jose Luis Ceron Sarria: Architecture design and infrastructure
- OSM Community: For the valuable notes data
This analytics project is part of a larger ecosystem for OSM Notes analysis:
OSMLatam/
├── OSM-Notes-Ingestion/ # Data ingestion from OSM API/Planet
├── OSM-Notes-Analytics/ # Data Warehouse & ETL (this repository)
├── OSM-Notes-Viewer/ # Web frontend visualization
└── osm-common/ # Shared Bash libraries
┌─────────────────────────────────────┐
│ OSM-Notes-Ingestion │
│ - Downloads OSM notes │
│ - Populates base tables │
│ - Publishes WMS layer │
└──────────────┬────────────────────────┘
│ feeds data
▼
┌─────────────────────────────────────┐
│ OSM-Notes-Analytics │
│ - ETL process (this repo) │
│ - Star schema data warehouse │
│ - Datamarts for analytics │
│ - Export to JSON │
└──────────────┬────────────────────────┘
│ exports JSON data
▼
┌─────────────────────────────────────┐
│ OSM-Notes-Viewer │
│ - Web dashboard │
│ - Interactive visualizations │
│ - User & country profiles │
└──────────────────────────────────────┘
┌─────────────────────┐
│ osm-common/ │
│ - Shared libraries │
│ - bash_logger.sh │
│ - validation.sh │
└─────────────────────┘
↑ used by all
-
- Downloads notes from OSM Planet and API
- Populates base tables:
notes,note_comments,users,countries - Publishes WMS layer for mapping applications
- Deploy this FIRST - analytics needs base data
-
OSM-Notes-Viewer (not yet publicly available):
- Web frontend for visualizing analytics
- Interactive dashboards
- User and country profiles
- Reads JSON exports from this analytics system
-
osm-common (shared):
- Common Bash libraries
- Used by all three repositories
- Logging, validation, error handling
-
OpenStreetMap: https://www.openstreetmap.org
- Create an issue in this repository
- Check the documentation in the
docs/directory - Review logs for error messages
The following major enhancements have been implemented:
- 21 new metrics added to both
datamartCountriesanddatamartUsers - Resolution tracking: Days to resolution, resolution rates
- Application analytics: Mobile vs desktop usage, most popular apps
- Content quality: Comment analysis, URL/mention detection
- Community health: Active notes, backlog, temporal patterns
- 88+ new automated tests added to validate all new metrics
- Automation detection: Identifies bot/automated notes vs human
- Experience levels: Classifies users from newcomer to legendary
- Note activity metrics: Tracks accumulated comments and reopenings
- Hashtag bridge table: Supports unlimited hashtags per note
- Partitioned facts table: 10-50x faster date-based queries
- Specialized indexes: Optimized for common query patterns
- Automated maintenance: VACUUM and ANALYZE on partitions
See docs/DASHBOARD_ANALYSIS.md for complete details on available metrics.
Current Version: 2025-10-26