OSM-Notes-Analytics

Data Warehouse, ETL, and Analytics for OpenStreetMap Notes

Overview

This repository contains the analytics and data warehouse components for the OSM Notes profiling system. It provides ETL (Extract, Transform, Load) processes, a star schema data warehouse, and datamarts for analyzing OSM notes data.

Features

Star Schema Data Warehouse: Comprehensive dimensional model for notes analysis
Enhanced ETL Process: Robust ETL with recovery, validation, and monitoring
Partitioned Facts Table: Automatic partitioning by year (2013-2025+)
Country Datamart: Pre-computed analytics by country (70+ metrics)
User Datamart: Pre-computed analytics by user (70+ metrics)
Profile Generator: Generate country and user profiles
Advanced Dimensions:
- Timezones for local time analysis
- Seasons based on date and latitude
- Continents for geographical grouping
- Application versions tracking
- SCD2 for username changes
- Automation level detection
- User experience levels
- Hashtags (unlimited via bridge table)
Enhanced Metrics (October 2025):
- Resolution metrics: avg/median days to resolution, resolution rate
- Application statistics: mobile/desktop apps count, most used app
- Content quality: comment length, URL/mention detection, engagement metrics
- Community health: active notes, backlog, age distribution, recent activity
Comprehensive Testing: 168+ automated tests (90%+ function coverage)

Prerequisites

PostgreSQL 12 or higher
PostGIS 3.0 or higher
Bash 4.0 or higher
OSM Notes Ingestion Database: This analytics system reads from the base notes tables populated by the OSM-Notes-Ingestion ingestion system

Database Architecture

This system uses a shared database approach with separate schemas:

Database: osm_notes
├── Schema: public          # Base tables (managed by Ingestion repo)
│   ├── notes
│   ├── note_comments
│   ├── note_comments_text
│   ├── users
│   └── countries
└── Schema: dwh             # Data Warehouse (managed by this repo)
    ├── facts               # Fact table
    ├── dimension_*         # Dimension tables
    └── datamart_*          # Datamart tables

Quick Start

This guide walks you through the complete process from scratch to having exportable JSON datamarts.

Process Overview

1. Base Data       → 2. ETL/DWH        → 3. Datamarts      → 4. JSON Export
   (notes)            (facts, dims)       (aggregations)      (web viewer)

Step-by-Step Instructions

Step 1: Clone the Repository

git clone https://github.com/OSMLatam/OSM-Notes-Analytics.git
cd OSM-Notes-Analytics

Step 2: Configure Database Connection

Edit etc/properties.sh with your database credentials:

# Database configuration
DBNAME="osm_notes"          # Same database as Ingestion
DB_USER="myuser"

Step 3: Verify Base Tables

First, ensure you have the base OSM notes data:

psql -d notes -c "SELECT COUNT(*) FROM notes"
psql -d notes -c "SELECT COUNT(*) FROM note_comments"
psql -d notes -c "SELECT COUNT(*) FROM note_comments_text"

If these tables are empty or don't exist, you need to load the OSM notes data first using the OSM-Notes-Ingestion system.

Step 4: Run the ETL Process

The ETL creates the data warehouse (schema dwh) with:

Fact tables (partitioned by year for optimal performance)
Dimension tables (users, countries, dates, etc.)
All necessary transformations
Automatically updates datamarts

Initial Load (first time, complete data):

cd bin/dwh
./ETL.sh

Incremental Update (regular operations, new data only):

cd bin/dwh
./ETL.sh incremental

This process can take several hours for initial load, 15-30 minutes for incremental updates.

What the ETL does automatically:

Creates schema dwh with all tables
Creates automatic partitions for facts table (2013-2025+)
Populates dimension tables
Loads facts from note_comments
Creates indexes and constraints
Updates datamarts (countries and users)
Creates specialized views for hashtag analytics
Calculates automation levels for users
Updates experience levels for users
Creates note activity metrics (comment counts, reopenings)

Step 5: Verify DWH Creation

Check that the data warehouse was created:

# Check schema exists
psql -d notes -c "SELECT schema_name FROM information_schema.schemata WHERE schema_name = 'dwh'"

# Check tables exist
psql -d notes -c "SELECT tablename FROM pg_tables WHERE schemaname = 'dwh' ORDER BY tablename"

# Check fact counts
psql -d notes -c "SELECT COUNT(*) FROM dwh.facts"
psql -d notes -c "SELECT COUNT(*) FROM dwh.dimension_users"
psql -d notes -c "SELECT COUNT(*) FROM dwh.dimension_countries"

Step 6: Create and Populate Datamarts

✅ Datamarts are automatically updated during ETL execution. No manual step needed!

The datamarts aggregate data for quick access and are automatically populated after ETL completes.

Manual Update (only if needed):

# Update users datamart
cd bin/dwh/datamartUsers
./datamartUsers.sh

# Update countries datamart
cd bin/dwh/datamartCountries
./datamartCountries.sh

Note: Datamarts process incrementally (only modified entities) for optimal performance.

Step 7: Verify Datamart Creation

# Check datamart tables exist
psql -d notes -c "SELECT tablename FROM pg_tables WHERE schemaname = 'dwh' AND tablename LIKE 'datamart%'"

# Check counts
psql -d notes -c "SELECT COUNT(*) FROM dwh.datamartusers"
psql -d notes -c "SELECT COUNT(*) FROM dwh.datamartcountries"

# View sample user
psql -d notes -c "SELECT user_id, username, history_whole_open, history_whole_closed FROM dwh.datamartusers LIMIT 5"

# View sample country
psql -d notes -c "SELECT country_id, country_name_en, history_whole_open, history_whole_closed FROM dwh.datamartcountries LIMIT 5"

Step 8: Export to JSON (Optional)

Once datamarts are populated, export to JSON for the web viewer:

cd bin/dwh
./exportDatamartsToJSON.sh

This creates JSON files in ./output/json/:

Individual files per user: users/{user_id}.json
Individual files per country: countries/{country_id}.json
Index files: indexes/users.json, indexes/countries.json
Metadata: metadata.json

Export Features:

Atomic writes: Files generated in temporary directory, validated, then moved atomically
Schema validation: Each JSON file validated against schemas before export
Fail-safe: On validation failure, keeps existing files and logs error
No partial updates: Either all files are valid and moved, or none

See JSON Export Documentation and Atomic Validation Export for complete details.

Incremental Updates

For ongoing updates, run these in sequence:

# 1. Update base data (your OSM notes import process)

# 2. Update DWH
cd bin/dwh
./ETL.sh --incremental

# 3. Update datamarts
cd datamartUsers
./datamartUsers.sh
cd ../datamartCountries
./datamartCountries.sh

# 4. Export JSON (optional)
cd ..
./exportDatamartsToJSON.sh

# Note: The export script validates all JSON files before moving them to the final destination.
# If validation fails, it keeps existing files and exits with an error, ensuring data integrity.

Scheduling with Cron

For automated analytics updates:

# Update ETL every hour (after ingestion completes)
0 * * * * ~/OSM-Notes-Analytics/bin/dwh/ETL.sh --incremental

# Update country datamart daily
0 2 * * * ~/OSM-Notes-Analytics/bin/dwh/datamartCountries/datamartCountries.sh

# Update user datamart daily (processes 500 users per run)
30 2 * * * ~/OSM-Notes-Analytics/bin/dwh/datamartUsers/datamartUsers.sh

# Export to JSON and push to GitHub Pages (every 15 minutes, after datamarts update)
# This script exports JSON files and automatically deploys them to GitHub Pages
45 2 * * * ~/OSM-Notes-Analytics/bin/dwh/exportAndPushToGitHub.sh

Complete Workflow with JSON Export

For a complete automated pipeline that includes JSON export with validation:

# Create wrapper script: /opt/osm-analytics/update-and-export.sh
#!/bin/bash
cd /opt/osm-analytics/OSM-Notes-Analytics

# ETL incremental update
./bin/dwh/ETL.sh --incremental || exit 1

# Update datamarts
./bin/dwh/datamartUsers/datamartUsers.sh || exit 1
./bin/dwh/datamartCountries/datamartCountries.sh || exit 1

# Export to JSON and push to GitHub Pages
# The script exports JSON files and automatically deploys them to GitHub Pages
./bin/dwh/exportAndPushToGitHub.sh || exit 1

# If we get here, all files are valid and exported
echo "SUCCESS: All exports validated and moved to destination"

Then schedule this wrapper:

# Run complete pipeline every 15 minutes
*/15 * * * * /opt/osm-analytics/update-and-export.sh >> /var/log/osm-analytics.log 2>&1

Key features of JSON export:

✅ Atomic writes: Files are generated in temporary directory first
✅ Schema validation: Each JSON file is validated before final export
✅ Fail-safe: On validation failure, keeps existing files and exits with error
✅ No partial updates: Either all files are valid and moved, or none

Directory Structure

OSM-Notes-Analytics/
├── bin/                    # Executable scripts
│   ├── dwh/               # ETL and datamart scripts
│   │   ├── ETL.sh         # Main ETL process
│   │   ├── profile.sh     # Profile generator
│   │   ├── cleanupDWH.sh  # Data warehouse cleanup script
│   │   ├── README.md      # DWH scripts documentation
│   │   ├── datamartCountries/
│   │   │   └── datamartCountries.sh
│   │   └── datamartUsers/
│   │       └── datamartUsers.sh
│   └── README.md          # Scripts documentation
├── etc/                    # Configuration files
│   ├── properties.sh      # Database configuration
│   ├── etl.properties     # ETL configuration
│   └── README.md          # Configuration documentation
├── sql/                    # SQL scripts
│   ├── dwh/               # DWH DDL and procedures
│   │   ├── ETL_*.sql      # ETL scripts
│   │   ├── Staging_*.sql  # Staging procedures
│   │   ├── datamartCountries/  # Country datamart SQL
│   │   └── datamartUsers/      # User datamart SQL
│   └── README.md          # SQL documentation
├── scripts/                # Utility scripts
│   ├── install-hooks.sh   # Git hooks installer
│   ├── setup_analytics.sh # Initial setup script
│   ├── validate-all.sh    # Validation script
│   └── README.md          # Scripts documentation
├── tests/                  # Test suites
│   ├── unit/              # Unit tests
│   │   ├── bash/          # Bash script tests
│   │   └── sql/           # SQL tests
│   ├── integration/       # Integration tests
│   ├── run_all_tests.sh   # Run all tests
│   ├── run_dwh_tests.sh   # Run DWH tests
│   ├── run_quality_tests.sh  # Run quality tests
│   └── README.md          # Testing documentation
├── docs/                   # Documentation
│   ├── DWH_Star_Schema_ERD.md           # Star schema diagrams
│   ├── DWH_Star_Schema_Data_Dictionary.md  # Data dictionary
│   ├── ETL_Enhanced_Features.md         # ETL features
│   ├── CI_CD_Guide.md                   # CI/CD workflows
│   └── README.md                        # Documentation index
└── lib/                    # Shared libraries
    ├── osm-common/         # Common OSM utilities
    │   ├── bash_logger.sh
    │   ├── commonFunctions.sh
    │   ├── validationFunctions.sh
    │   ├── errorHandlingFunctions.sh
    │   └── consolidatedValidationFunctions.sh
    └── README.md          # Library documentation

ETL Execution Modes

Create Mode (Initial Setup)

./bin/dwh/ETL.sh --create

Creates the complete data warehouse from scratch, including all dimensions and facts.

Incremental Mode (Regular Updates)

./bin/dwh/ETL.sh --incremental

Processes only new data since the last ETL run. Use this for scheduled updates.

Data Warehouse Schema

Fact Table

dwh.facts: Central fact table containing note actions and metrics
- Partitioned by year (action_at) for optimal performance
- Automatic partition creation for current and future years
- Each year stored in separate partition (e.g., facts_2024, facts_2025)
- 10-50x faster queries when filtering by date

Dimension Tables

dimension_users: User information with SCD2 support
dimension_countries: Countries with ISO codes and regions
dimension_regions: Geographic regions
dimension_continents: Continental grouping
dimension_days: Date dimension with enhanced attributes
dimension_time_of_week: Hour of week with period of day
dimension_applications: Applications used to create notes
dimension_application_versions: Application version tracking
dimension_hashtags: Hashtags found in notes
dimension_timezones: Timezone information
dimension_seasons: Seasons based on date and latitude

Datamart Tables

datamart_countries: Pre-computed country analytics (70+ metrics)
- Historical metrics: notes opened/closed by country
- Resolution metrics: avg/median days to resolution, resolution rate
- Application statistics: mobile/desktop app usage, most used app
- Content quality: comment length, URLs, mentions, engagement
- Community health: active notes, backlog size, age distribution
- Hashtag analysis: top hashtags, usage patterns
datamart_users: Pre-computed user analytics (70+ metrics)
- Historical metrics: notes opened/closed by user
- Resolution metrics: avg/median days to resolution, resolution rate
- Application statistics: mobile/desktop app usage
- Content quality: comment length, URLs, mentions, engagement
- Community health: active notes, recent activity
- Automation level: human/automated detection
- Experience level: beginner to legendary contributor

Performance Considerations

Table Partitioning

The dwh.facts table is partitioned by year using the action_at column:

Automatic partition management: The ETL automatically creates partitions for:
- Current year (always verified)
- Next year (to prevent failures on year transition)
- One additional year ahead (buffer)
Zero maintenance: No manual intervention needed when the year changes
Performance benefits:
- 10-50x faster queries when filtering by date
- Only scans relevant year partitions (PostgreSQL partition pruning)
- Faster VACUUM and maintenance operations per partition
Easy archival: Old year partitions can be detached/archived independently

See docs/partitioning_strategy.md for complete details.

Initial Load Times

ETL Initial Load: ~30 hours (parallel by year since 2013)
Country Datamart: ~20 minutes
User Datamart: ~5 days (500 users per run, asynchronous)

Resource Requirements

Memory: 4GB+ recommended for ETL
Disk Space: Depends on notes volume (GB scale)
CPU: Multi-core recommended for parallel processing

Testing

Quick Start testing

# Quality tests (fast, no database required)
./tests/run_quality_tests.sh

# DWH tests (requires database 'dwh')
./tests/run_dwh_tests.sh

# All tests
./tests/run_all_tests.sh

CI/CD Integration

This project includes comprehensive CI/CD with:

✅ GitHub Actions workflows for automated testing
✅ Pre-commit hooks for code quality
✅ Pre-push hooks for full validation
✅ Automated dependency checking

Install git hooks:

./scripts/install-hooks.sh

Full validation:

./scripts/validate-all.sh

See CI/CD Guide for complete documentation.

Logging

The ETL process creates detailed logs:

# Follow ETL progress
tail -40f $(ls -1rtd /tmp/ETL_* | tail -1)/ETL.log

# Set log level
export LOG_LEVEL=DEBUG
./bin/dwh/ETL.sh --incremental

Available log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL

Maintenance and Cleanup

Data Warehouse Cleanup Script

The project includes a cleanup script for maintenance and troubleshooting:

# Safe operations (no data loss):
./bin/dwh/cleanupDWH.sh --remove-temp-files    # Clean temporary files
./bin/dwh/cleanupDWH.sh --dry-run              # Preview operations

# Destructive operations (require confirmation):
./bin/dwh/cleanupDWH.sh                        # Full cleanup
./bin/dwh/cleanupDWH.sh --remove-all-data      # DWH objects only

When to use:

Development: Clean temporary files regularly
Testing: Reset environment between test runs
Troubleshooting: Remove corrupted objects
Maintenance: Free disk space

⚠️ Warning: Destructive operations permanently delete data! Always use --dry-run first.

For detailed maintenance procedures, see DWH Maintenance Guide.

Troubleshooting

Common Issues

"Schema 'dwh' does not exist"

Solution: Run ./bin/dwh/ETL.sh --create first to create the data warehouse.

"Table 'dwh.datamartusers' does not exist"

Solution: Run the datamart scripts:

bin/dwh/datamartUsers/datamartUsers.sh
bin/dwh/datamartCountries/datamartCountries.sh

ETL takes too long

The ETL processes data by year in parallel. Adjust parallelism in etc/properties.sh:

MAX_THREADS=8  # Increase for more cores

Datamart not fully populated

Datamarts process entities incrementally (500 at a time). Run the script multiple times:

# Keep running until it says "0 users processed"
while true; do
  ./datamartUsers.sh
  sleep 5
done

Or check the modified flag:

SELECT COUNT(*) FROM dwh.dimension_users WHERE modified = TRUE;

When it returns 0, all users are processed.

JSON export is empty

Ensure datamarts have data:

SELECT COUNT(*) FROM dwh.datamartusers;
SELECT COUNT(*) FROM dwh.datamartcountries;

If counts are 0, re-run the datamart population scripts.

ETL Fails to Start

Check that:

Database connection is configured correctly
Base tables exist (populated by Ingestion system)
PostgreSQL is running and accessible

Performance Issues

Increase ETL_MAX_PARALLEL_JOBS in etc/etl.properties
Adjust ETL_BATCH_SIZE for better throughput
Run VACUUM ANALYZE on base tables

Data Integrity Issues

Check for orphaned facts (example query):

psql -d osm_notes -c "SELECT COUNT(*) FROM dwh.facts f
LEFT JOIN dwh.dimension_countries c ON f.dimension_id_country = c.dimension_country_id
WHERE c.dimension_country_id IS NULL;"

Integration with Ingestion System

This analytics system depends on the OSM-Notes-Ingestion ingestion system:

Ingestion (OSM-Notes-Ingestion)
- Downloads notes from OSM Planet and API
- Populates base tables: notes, note_comments, note_comments_text
- Manages WMS layer publication
Analytics (this repository)
- Reads from base tables
- Transforms data into star schema
- Generates datamarts and profiles

Deployment Order:

Deploy and run Ingestion system first
Wait for base tables to be populated
Deploy and run Analytics system

Documentation

DWH Star Schema ERD: Entity-relationship diagram
Data Dictionary: Complete schema documentation
ETL Enhanced Features: Advanced ETL capabilities
CI/CD Guide: CI/CD workflows and git hooks
Testing Guide: Complete testing documentation

Configuration

Database Configuration (`etc/properties.sh`)

# Database configuration
DBNAME="osm_notes"
DB_USER="myuser"

# Processing configuration
LOOP_SIZE="10000"
MAX_THREADS="4"

ETL Configuration (`etc/etl.properties`)

# Performance
ETL_BATCH_SIZE=1000
ETL_PARALLEL_ENABLED=true
ETL_MAX_PARALLEL_JOBS=4

# Resource Control
MAX_MEMORY_USAGE=80
MAX_DISK_USAGE=90
ETL_TIMEOUT=7200

# Recovery
ETL_RECOVERY_ENABLED=true

# Validation
ETL_VALIDATE_INTEGRITY=true

Contributing

See CONTRIBUTING.md for guidelines on contributing to this project.

License

See LICENSE for license information.

Acknowledgments

Andres Gomez (@AngocA): Main developer
Jose Luis Ceron Sarria: Architecture design and infrastructure
OSM Community: For the valuable notes data

Project Ecosystem

This analytics project is part of a larger ecosystem for OSM Notes analysis:

Repository Structure

OSMLatam/
├── OSM-Notes-Ingestion/     # Data ingestion from OSM API/Planet
├── OSM-Notes-Analytics/     # Data Warehouse & ETL (this repository)
├── OSM-Notes-Viewer/        # Web frontend visualization
└── osm-common/              # Shared Bash libraries

How Projects Work Together

┌─────────────────────────────────────┐
│  OSM-Notes-Ingestion                 │
│  - Downloads OSM notes                │
│  - Populates base tables              │
│  - Publishes WMS layer                │
└──────────────┬────────────────────────┘
               │ feeds data
               ▼
┌─────────────────────────────────────┐
│  OSM-Notes-Analytics                │
│  - ETL process (this repo)           │
│  - Star schema data warehouse        │
│  - Datamarts for analytics           │
│  - Export to JSON                    │
└──────────────┬────────────────────────┘
               │ exports JSON data
               ▼
┌─────────────────────────────────────┐
│  OSM-Notes-Viewer                    │
│  - Web dashboard                     │
│  - Interactive visualizations        │
│  - User & country profiles           │
└──────────────────────────────────────┘

       ┌─────────────────────┐
       │  osm-common/        │
       │  - Shared libraries │
       │  - bash_logger.sh   │
       │  - validation.sh    │
       └─────────────────────┘
         ↑ used by all

Related Projects

OSM-Notes-Ingestion:
- Downloads notes from OSM Planet and API
- Populates base tables: notes, note_comments, users, countries
- Publishes WMS layer for mapping applications
- Deploy this FIRST - analytics needs base data
OSM-Notes-Viewer (not yet publicly available):
- Web frontend for visualizing analytics
- Interactive dashboards
- User and country profiles
- Reads JSON exports from this analytics system
osm-common (shared):
- Common Bash libraries
- Used by all three repositories
- Logging, validation, error handling
OpenStreetMap: https://www.openstreetmap.org

Support

Create an issue in this repository
Check the documentation in the docs/ directory
Review logs for error messages

Recent Enhancements (October 2025)

The following major enhancements have been implemented:

Datamart Enhancements

21 new metrics added to both datamartCountries and datamartUsers
Resolution tracking: Days to resolution, resolution rates
Application analytics: Mobile vs desktop usage, most popular apps
Content quality: Comment analysis, URL/mention detection
Community health: Active notes, backlog, temporal patterns
88+ new automated tests added to validate all new metrics

Enhanced Dimensions

Automation detection: Identifies bot/automated notes vs human
Experience levels: Classifies users from newcomer to legendary
Note activity metrics: Tracks accumulated comments and reopenings
Hashtag bridge table: Supports unlimited hashtags per note

Performance

Partitioned facts table: 10-50x faster date-based queries
Specialized indexes: Optimized for common query patterns
Automated maintenance: VACUUM and ANALYZE on partitions

See docs/DASHBOARD_ANALYSIS.md for complete details on available metrics.

Version

Current Version: 2025-10-26

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.git-hooks		.git-hooks
.github/workflows		.github/workflows
ToDo		ToDo
bin		bin
docs		docs
etc		etc
lib		lib
scripts		scripts
sql		sql
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
.markdownlint.json		.markdownlint.json
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
package.json		package.json

License

OSMLatam/OSM-Notes-Analytics

Folders and files

Latest commit

History

Repository files navigation

OSM-Notes-Analytics

Overview

Features

Prerequisites

Database Architecture

Quick Start

Process Overview

Step-by-Step Instructions

Step 1: Clone the Repository

Step 2: Configure Database Connection

Step 3: Verify Base Tables

Step 4: Run the ETL Process

Step 5: Verify DWH Creation

Step 6: Create and Populate Datamarts

Step 7: Verify Datamart Creation

Step 8: Export to JSON (Optional)

Incremental Updates

Scheduling with Cron

Complete Workflow with JSON Export

Directory Structure

ETL Execution Modes

Create Mode (Initial Setup)

Incremental Mode (Regular Updates)

Data Warehouse Schema

Fact Table

Dimension Tables

Datamart Tables

Performance Considerations

Table Partitioning

Initial Load Times

Resource Requirements

Testing

Quick Start testing

CI/CD Integration

Logging

Maintenance and Cleanup

Data Warehouse Cleanup Script

Troubleshooting

Common Issues

"Schema 'dwh' does not exist"

"Table 'dwh.datamartusers' does not exist"

ETL takes too long

Datamart not fully populated

JSON export is empty

ETL Fails to Start

Performance Issues

Data Integrity Issues

Integration with Ingestion System

Documentation

Configuration

Database Configuration (etc/properties.sh)

ETL Configuration (etc/etl.properties)

Contributing

License

Acknowledgments

Project Ecosystem

Repository Structure

How Projects Work Together

Related Projects

Support

Recent Enhancements (October 2025)

Datamart Enhancements

Enhanced Dimensions

Performance

Version

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Database Configuration (`etc/properties.sh`)

ETL Configuration (`etc/etl.properties`)

Packages