Skip to content

Gozzim/RIPE-Atlas-Traceroute-Analysis

Repository files navigation

Traceroute Data Analyzer and Anomaly Detector

CodeFactor CI Coverage Status

This project provides a comprehensive framework for analyzing RIPE Atlas traceroute data to detect network performance and routing anomalies. It can process large datasets from local files or a ClickHouse database, establish performance baselines, and compare current data against those baselines to identify significant deviations.

Purpose

The main goal of this project is to provide a powerful and flexible tool for network operators, researchers, and enthusiasts to gain insights from RIPE Atlas traceroute data. By analyzing large-scale measurements, users can:

  • Monitor Network Health: Track key performance indicators (KPIs) like Round-Trip Time (RTT), path length, and packet loss over time.
  • Detect Anomalies: Automatically identify significant changes in network performance or routing behavior that could indicate outages, congestion, or rerouting events.
  • Understand Routing Behavior: Analyze dominant paths and common network segments to understand how traffic flows across the internet.
  • Establish Baselines: Create a statistical snapshot of "normal" network behavior to use as a benchmark for future comparisons.

How It Works

The analysis pipeline follows these high-level steps:

  1. Data Ingestion: Reads traceroute data from local JSON Lines files (plain or compressed) or streams it directly from a ClickHouse database.
  2. Filtering: Narrows down the dataset based on a rich set of user-defined criteria (e.g., source/destination country, IP range, probe tags).
  3. Parsing & Cleaning: Parses the raw data in parallel, cleans it, and structures it for analysis.
  4. Analysis:
    • In baseline mode, it calculates a comprehensive statistical summary of the data to define "normal" behavior.
    • In analyze mode, it compares a new dataset against a previously generated baseline, using statistical tests and thresholding to find anomalies.
  5. Reporting: Generates a detailed JSON report containing all metadata, statistics, and a list of detected anomalies.
  6. Visualization: Creates a variety of plots (histograms, time-series, boxplots) to help visualize the data and the detected anomalies.
  7. Notification: Sends a summary of the run, along with the results file and plots, to a configured Matrix chat room.

Features

  • Flexible Data Ingestion: Process traceroute data from JSON Lines files (.json, .bz2) or stream directly from a ClickHouse database.
  • Two-Phase Analysis:
    • baseline mode: Establishes a statistical baseline of "normal" network behavior from a historical dataset.
    • analyze mode: Compares a current dataset against a previously generated baseline to detect anomalies.
  • Advanced Anomaly Detection:
    • Individual Indicators: Detects changes in performance metrics (RTT, path length, jitter), success/timeout rates, and network topology (path changes, core segment changes).
    • Statistical Distribution Analysis: Uses Kolmogorov-Smirnov and Anderson-Darling tests to identify subtle shifts in the RTT distribution profile.
    • Composite Events: Correlates individual indicators to diagnose higher-level events like "Major Rerouting Event," "Path Instability," and "Performance Profile Shift."
  • Powerful Filtering: Filter measurements by a wide range of source or destination criteria, including country, ASN, IP range, RIPE Atlas probe tags, and geographic location.
  • Comprehensive Output: Generates detailed JSON reports, statistical plots (histograms, scatter plots, etc.), and a full log of the analysis run.
  • Automated Notifications: Integrates with Matrix to send detailed success or failure notifications upon task completion, including attached results and plots.

Installation

  1. Clone the repository:

    git clone https://github.com/Gozzim/RIPE-Atlas-Traceroute-Analysis.git
    cd RIPE-Atlas-Traceroute-Analysis
  2. Create a virtual environment (recommended):

    python3 -m venv venv
    source venv/bin/activate
  3. Install the required dependencies: The requirements.txt file contains all necessary Python packages. Key dependencies include:

    • pandas and pyarrow for data manipulation.
    • matplotlib and seaborn for plotting.
    • clickhouse-driver for connecting to ClickHouse.
    • tqdm for progress bars.
    • orjson for fast JSON processing.
    • coloredlogs for enhanced console logging.
    pip install -r requirements.txt

Configuration

Matrix Bot Notifications (Optional)

To enable notifications, you need a dedicated bot account on a Matrix homeserver.

  1. Copy the configuration template:
    cp config.ini.example config.ini
  2. Edit config.ini and fill in your bot's details:
    [matrix_bot]
    homeserver = https://matrix-client.matrix.org
    user_id = @my-bot:matrix.org
    password = your_bot_password_here
    room_id = !yourRoomId:matrix.org

ClickHouse Connection (Optional)

The scripts can be configured to connect to a ClickHouse database using environment variables:

export CLICKHOUSE_HOST='127.0.0.1'
export CLICKHOUSE_PORT='9000'
export CLICKHOUSE_DB='atlas'
export CLICKHOUSE_USER='default'
export CLICKHOUSE_PASSWORD='password123'

Alternatively, you can provide these details as command-line arguments.

Usage

The project contains three main executable scripts: main.py, scripts/import.py, and scripts/create_schema.py.

1. main.py - The Analyzer

This is the primary script for running an analysis.

Step 1: Create a Baseline

First, run the script in baseline mode on a large, representative dataset to establish a base for analysis. A good baseline represents a period of "normal" or stable network behavior.

Example (from files):

# Create a baseline for ICMP traffic to the US, calculating path statistics.
python main.py --mode baseline \
  --dest-country US \
  --protocol ICMP \
  --path-stats --analyze-core-paths \
  data/archive/january-week-1/*.json.bz2

This will create an output directory (by default in out/) containing:

  • *_results_baseline.json: The detailed statistical baseline.
  • *_baseline_data.parquet: A Parquet file with the raw data used, for future distribution comparison.
  • *.log: A log file.
  • Optional Plots.

Step 2: Run an Analysis

Next, run the script in analyze mode on a new dataset, pointing it to the baseline file you just created. This will compare the new data against the established normal and report any detected anomalies.

Example (from ClickHouse):

# Analyze the last hour of data from ClickHouse against the baseline.
python main.py --mode analyze \
  --baseline-file out/example/example_results_baseline.json \
  --clickhouse \
  --ch-where-clause "timestamp >= now() - interval '1 hour'" \
  --dest-country US \
  --protocol ICMP \
  --path-stats --analyze-core-paths \
  --plot-metrics rtt observed_pathlen \
  --plot-aggregations daily hourly

This will generate a new output directory containing:

  • *_results_analyze.json: A full report including any detected individual indicators and composite events.
  • *.log: A log file.
  • Optional plots comparing the current data to the baseline.

2. scripts/create_schema.py - Database Setup

This script sets up the required database, tables, and materialized views in ClickHouse. It only needs to be run once.

Example:

python scripts/create_schema.py --host 127.0.0.1 --db atlas

3. scripts/import.py - Data Importer

Use this script to parse traceroute JSON files and import them into a ClickHouse database.

Prerequisite: You must first create the database schema using create_schema.py.

Example:

# Import all traceroutes from July 2025 destined for Germany or France.
python scripts/import.py \
  --host your-ch-host \
  --db atlas \
  --workers 8 \
  --optimize-final \
  --dest-country DE FR \
  data/archive/2025-07-*.json.bz2

This command will parse all files matching the pattern, filter them, and import the data into the atlas database using 8 parallel processes.

Command-Line Reference

main.py

Data Source Options

Argument Description
INPUT_FILENAME Positional argument. Paths or patterns for input JSON Lines files. Required if not using --clickhouse.
--clickhouse Load data from ClickHouse instead of files.

ClickHouse Connection Options

Argument Description
--ch-host ClickHouse server host.
--ch-port ClickHouse server port.
--ch-database ClickHouse database name.
--ch-user ClickHouse username.
--ch-password ClickHouse password.
--ch-where-clause Optional custom SQL WHERE clause for ClickHouse queries.

Input/Output Options

Argument Description
-o, --output-dir Directory to save all output files. Default: Auto-generated in the out/ folder.
--log-file Path to log file. Default: <output_dir>/<base>.log.

Mode Options

Argument Description
-m, --mode Required. Operation mode: baseline to generate stats, analyze to compare against a baseline.
-i, --baseline-file Path to the baseline JSON file. Required for analyze mode.

Analysis Options

Argument Description
-n, --limit Process only the first N measurements/records.
--probe-stats Enable calculation and reporting of per-probe statistics.
--path-stats Enable calculation of path statistics (dominant path, unique count).
-p, --protocol Filter measurements by protocols (ICMP, UDP, TCP).
--include-private-ips Include measurements to/through private/special IP addresses.

Filtering Options (Source & Destination)

These arguments filter the data based on probe properties. Most require the RIPE Atlas API.

Argument Description
--source-country, --dest-country Filter by ISO country codes.
--source-id, --dest-id Filter by specific probe IDs.
--source-ip-range, --dest-ip-range Filter by IP address ranges in CIDR notation.
--source-tags, --dest-tags Filter by probe tags.
--source-type, --dest-type Filter by probe type (probe, anchor, software).
--source-lat-range, --dest-lat-range Filter by latitude range (MIN MAX).
--source-lon-range, --dest-lon-range Filter by longitude range (MIN MAX).
--source-radius, --dest-radius Filter within a radius: "LAT,LON:DIST_KM".

Path Analysis Options

Argument Description
--ignore-start-hops Ignore the first N hops in path analysis.
--ignore-end-hops Ignore the last N hops in path analysis.
--analyze-core-paths Enable analysis of common core path segments.
--core-path-min-len Minimum length of a segment to be a core path candidate.
--core-path-max-len Maximum length of a segment to be a core path candidate.
--core-path-min-support-abs Minimum absolute number of traceroutes a segment must appear in.
--core-path-min-support-rel Minimum relative frequency a segment must appear in.
--core-path-top-n Report statistics for the top K most frequent core path segments.
--core-path-uninformative-threshold Ratio of * or PRIVATE hops allowed in a path for core analysis.

Plotting Options

Argument Description
--plot-metrics Required for plotting. One or more metrics to plot. Choices: rtt, rtt_std, first_hop_rtt, observed_pathlen, successful_pathlen, success_rate, timeout_rate.
--plot-aggregations Aggregation levels for plots. Default: none. Choices: daily, hourly, dayofweek, hourofday.
--plot-types Plot types to generate. Choices: hist, scatter, rolling, boxplot, path_perf.
--formats Output image formats for plots (png, pdf, svg, ...).
--highlight-outliers Highlight RTT outliers on the scatter plot.

Performance & Logging Options

Argument Description
-w, --workers Number of worker processes for processing.
--chunk-size Number of records per processing chunk.
--no-line-count Disable initial line count for progress bar estimation.
-l, --log-level Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
-q, --quiet Suppress console log output.

scripts/import.py

Argument Description
FILE_PATTERN Positional argument. Paths or patterns for input JSON files.
--host, --port, --db, --user, --password ClickHouse connection details.
--batch-size Number of rows per ClickHouse insert batch.
-w, --workers Number of worker processes for parsing files.
--db-insert-workers Number of worker threads for database inserts.
--optimize-final Run OPTIMIZE TABLE FINAL after the import completes.
--limit Process only the first N input lines in total.
--imap-chunksize Chunksize for the multiprocessing pool.
--source-*, --dest-*, --protocol All filtering options from main.py are also available here.

scripts/create_schema.py

Argument Description
--host, --port, --user, --password ClickHouse connection details.
--db The name of the database to create and/or apply the schema to.
--schema Path to the .sql schema file to execute.

Output Description

  • JSON Results File (*_results_mode.json): A comprehensive JSON file containing all metadata, processing summaries, statistical aggregations, path analyses, and a list of detected anomalies.
  • Parquet Data File (*_baseline_data.parquet): Created in baseline mode, this file stores the processed DataFrame. It is used by the analyze mode to perform the K-S and A-D distribution tests.
  • Plots: Visualizations of key metrics, saved as .png files (or other specified formats).
  • Log File (.log): A detailed log of the entire run, useful for debugging.
  • Matrix Notifications: Real-time alerts sent to your configured chat room, summarizing the run and attaching the results file and any plots.

Troubleshooting

ClickHouse 'Max query size exceeded' Error

Problem: When running an analysis from ClickHouse with a very large number of filters (thousands of --source-id or --dest-ip values), you may encounter an error similar to this:

DB::Exception: Max query size exceeded (can be increased with the `max_query_size` setting)

This happens because the script constructs a single, very long SQL query string that exceeds the default safety limit on the ClickHouse server.

Solution: The recommended solution is to increase this limit on the ClickHouse server for the user profile you are connecting with.

Instructions:

  1. Locate your ClickHouse user configuration file. This is typically /etc/clickhouse-server/users.xml or a file inside /etc/clickhouse-server/users.d/.

  2. Edit the file (e.g., sudo nano /etc/clickhouse-server/users.xml).

  3. Inside the profile of your user, add or modify the max_query_size setting.

    <!-- Inside /etc/clickhouse-server/users.xml -->
    <clickhouse>
        <profiles>
            <default>
                <!-- ... -->
    
                <!-- Increase the max query size from default (262144=256KiB) -->
                <!-- This value (2097152) is 2 MiB -->
                <max_query_size>2097152</max_query_size>
            </default>
        </profiles>
        <!-- ... -->
    </clickhouse>
  4. Save the file and restart the ClickHouse server to apply the changes:

    sudo systemctl restart clickhouse-server

Project Structure

.
├── analyzer_lib/           # Core analysis library
│   ├── analysis/           # Main analysis logic
│   ├── common/             # Shared components (constants, utils, bot)
│   └── data_source/        # Data readers (files, ClickHouse)
├── config.ini.example      # Template for Matrix bot configuration
├── data/                   # Example data files
├── main.py                 # Main analysis script
├── out/                    # Default directory for all outputs
├── requirements.txt        # Project dependencies
├── schema/                 # SQL schema for ClickHouse
└── scripts/                # Helper scripts for import and setup

Acknowledgments

  • RIPE Atlas: This work relies on the open and extensive data provided by the RIPE Atlas global measurement network.
  • Google Gemini: Significant help with analysis and code.

License

This project is licensed under the GNU AGPL-3.0 license.

About

Ripe Atlas Traceroute Data Analyzer

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages