Traceroute Data Analyzer and Anomaly Detector

This project provides a comprehensive framework for analyzing RIPE Atlas traceroute data to detect network performance and routing anomalies. It can process large datasets from local files or a ClickHouse database, establish performance baselines, and compare current data against those baselines to identify significant deviations.

Purpose

The main goal of this project is to provide a powerful and flexible tool for network operators, researchers, and enthusiasts to gain insights from RIPE Atlas traceroute data. By analyzing large-scale measurements, users can:

Monitor Network Health: Track key performance indicators (KPIs) like Round-Trip Time (RTT), path length, and packet loss over time.
Detect Anomalies: Automatically identify significant changes in network performance or routing behavior that could indicate outages, congestion, or rerouting events.
Understand Routing Behavior: Analyze dominant paths and common network segments to understand how traffic flows across the internet.
Establish Baselines: Create a statistical snapshot of "normal" network behavior to use as a benchmark for future comparisons.

How It Works

The analysis pipeline follows these high-level steps:

Data Ingestion: Reads traceroute data from local JSON Lines files (plain or compressed) or streams it directly from a ClickHouse database.
Filtering: Narrows down the dataset based on a rich set of user-defined criteria (e.g., source/destination country, IP range, probe tags).
Parsing & Cleaning: Parses the raw data in parallel, cleans it, and structures it for analysis.
Analysis:
- In baseline mode, it calculates a comprehensive statistical summary of the data to define "normal" behavior.
- In analyze mode, it compares a new dataset against a previously generated baseline, using statistical tests and thresholding to find anomalies.
Reporting: Generates a detailed JSON report containing all metadata, statistics, and a list of detected anomalies.
Visualization: Creates a variety of plots (histograms, time-series, boxplots) to help visualize the data and the detected anomalies.
Notification: Sends a summary of the run, along with the results file and plots, to a configured Matrix chat room.

Features

Flexible Data Ingestion: Process traceroute data from JSON Lines files (.json, .bz2) or stream directly from a ClickHouse database.
Two-Phase Analysis:
- baseline mode: Establishes a statistical baseline of "normal" network behavior from a historical dataset.
- analyze mode: Compares a current dataset against a previously generated baseline to detect anomalies.
Advanced Anomaly Detection:
- Individual Indicators: Detects changes in performance metrics (RTT, path length, jitter), success/timeout rates, and network topology (path changes, core segment changes).
- Statistical Distribution Analysis: Uses Kolmogorov-Smirnov and Anderson-Darling tests to identify subtle shifts in the RTT distribution profile.
- Composite Events: Correlates individual indicators to diagnose higher-level events like "Major Rerouting Event," "Path Instability," and "Performance Profile Shift."
Powerful Filtering: Filter measurements by a wide range of source or destination criteria, including country, ASN, IP range, RIPE Atlas probe tags, and geographic location.
Comprehensive Output: Generates detailed JSON reports, statistical plots (histograms, scatter plots, etc.), and a full log of the analysis run.
Automated Notifications: Integrates with Matrix to send detailed success or failure notifications upon task completion, including attached results and plots.

Installation

Clone the repository:

git clone https://github.com/Gozzim/RIPE-Atlas-Traceroute-Analysis.git
cd RIPE-Atlas-Traceroute-Analysis

Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate

Install the required dependencies: The requirements.txt file contains all necessary Python packages. Key dependencies include:
- pandas and pyarrow for data manipulation.
- matplotlib and seaborn for plotting.
- clickhouse-driver for connecting to ClickHouse.
- tqdm for progress bars.
- orjson for fast JSON processing.
- coloredlogs for enhanced console logging.
```
pip install -r requirements.txt
```

Configuration

Matrix Bot Notifications (Optional)

To enable notifications, you need a dedicated bot account on a Matrix homeserver.

Copy the configuration template:
```
cp config.ini.example config.ini
```

Edit config.ini and fill in your bot's details:

[matrix_bot]
homeserver = https://matrix-client.matrix.org
user_id = @my-bot:matrix.org
password = your_bot_password_here
room_id = !yourRoomId:matrix.org

ClickHouse Connection (Optional)

The scripts can be configured to connect to a ClickHouse database using environment variables:

export CLICKHOUSE_HOST='127.0.0.1'
export CLICKHOUSE_PORT='9000'
export CLICKHOUSE_DB='atlas'
export CLICKHOUSE_USER='default'
export CLICKHOUSE_PASSWORD='password123'

Alternatively, you can provide these details as command-line arguments.

Usage

The project contains three main executable scripts: main.py, scripts/import.py, and scripts/create_schema.py.

1. `main.py` - The Analyzer

This is the primary script for running an analysis.

Step 1: Create a Baseline

First, run the script in baseline mode on a large, representative dataset to establish a base for analysis. A good baseline represents a period of "normal" or stable network behavior.

Example (from files):

# Create a baseline for ICMP traffic to the US, calculating path statistics.
python main.py --mode baseline \
  --dest-country US \
  --protocol ICMP \
  --path-stats --analyze-core-paths \
  data/archive/january-week-1/*.json.bz2

This will create an output directory (by default in out/) containing:

*_results_baseline.json: The detailed statistical baseline.
*_baseline_data.parquet: A Parquet file with the raw data used, for future distribution comparison.
*.log: A log file.
Optional Plots.

Step 2: Run an Analysis

Next, run the script in analyze mode on a new dataset, pointing it to the baseline file you just created. This will compare the new data against the established normal and report any detected anomalies.

Example (from ClickHouse):

# Analyze the last hour of data from ClickHouse against the baseline.
python main.py --mode analyze \
  --baseline-file out/example/example_results_baseline.json \
  --clickhouse \
  --ch-where-clause "timestamp >= now() - interval '1 hour'" \
  --dest-country US \
  --protocol ICMP \
  --path-stats --analyze-core-paths \
  --plot-metrics rtt observed_pathlen \
  --plot-aggregations daily hourly

This will generate a new output directory containing:

*_results_analyze.json: A full report including any detected individual indicators and composite events.
*.log: A log file.
Optional plots comparing the current data to the baseline.

2. `scripts/create_schema.py` - Database Setup

This script sets up the required database, tables, and materialized views in ClickHouse. It only needs to be run once.

Example:

python scripts/create_schema.py --host 127.0.0.1 --db atlas

3. `scripts/import.py` - Data Importer

Use this script to parse traceroute JSON files and import them into a ClickHouse database.

Prerequisite: You must first create the database schema using create_schema.py.

Example:

# Import all traceroutes from July 2025 destined for Germany or France.
python scripts/import.py \
  --host your-ch-host \
  --db atlas \
  --workers 8 \
  --optimize-final \
  --dest-country DE FR \
  data/archive/2025-07-*.json.bz2

This command will parse all files matching the pattern, filter them, and import the data into the atlas database using 8 parallel processes.

Command-Line Reference

`main.py`

Data Source Options

Argument	Description
`INPUT_FILENAME`	Positional argument. Paths or patterns for input JSON Lines files. Required if not using `--clickhouse`.
`--clickhouse`	Load data from ClickHouse instead of files.

ClickHouse Connection Options

Argument	Description
`--ch-host`	ClickHouse server host.
`--ch-port`	ClickHouse server port.
`--ch-database`	ClickHouse database name.
`--ch-user`	ClickHouse username.
`--ch-password`	ClickHouse password.
`--ch-where-clause`	Optional custom SQL WHERE clause for ClickHouse queries.

Input/Output Options

Argument	Description
`-o`, `--output-dir`	Directory to save all output files. Default: Auto-generated in the `out/` folder.
`--log-file`	Path to log file. Default: `<output_dir>/<base>.log`.

Mode Options

Argument	Description
`-m`, `--mode`	Required. Operation mode: `baseline` to generate stats, `analyze` to compare against a baseline.
`-i`, `--baseline-file`	Path to the baseline JSON file. Required for `analyze` mode.

Analysis Options

Argument	Description
`-n`, `--limit`	Process only the first N measurements/records.
`--probe-stats`	Enable calculation and reporting of per-probe statistics.
`--path-stats`	Enable calculation of path statistics (dominant path, unique count).
`-p`, `--protocol`	Filter measurements by protocols (ICMP, UDP, TCP).
`--include-private-ips`	Include measurements to/through private/special IP addresses.

Filtering Options (Source & Destination)

These arguments filter the data based on probe properties. Most require the RIPE Atlas API.

Argument	Description
`--source-country`, `--dest-country`	Filter by ISO country codes.
`--source-id`, `--dest-id`	Filter by specific probe IDs.
`--source-ip-range`, `--dest-ip-range`	Filter by IP address ranges in CIDR notation.
`--source-tags`, `--dest-tags`	Filter by probe tags.
`--source-type`, `--dest-type`	Filter by probe type (`probe`, `anchor`, `software`).
`--source-lat-range`, `--dest-lat-range`	Filter by latitude range (`MIN MAX`).
`--source-lon-range`, `--dest-lon-range`	Filter by longitude range (`MIN MAX`).
`--source-radius`, `--dest-radius`	Filter within a radius: `"LAT,LON:DIST_KM"`.

Path Analysis Options

Argument	Description
`--ignore-start-hops`	Ignore the first N hops in path analysis.
`--ignore-end-hops`	Ignore the last N hops in path analysis.
`--analyze-core-paths`	Enable analysis of common core path segments.
`--core-path-min-len`	Minimum length of a segment to be a core path candidate.
`--core-path-max-len`	Maximum length of a segment to be a core path candidate.
`--core-path-min-support-abs`	Minimum absolute number of traceroutes a segment must appear in.
`--core-path-min-support-rel`	Minimum relative frequency a segment must appear in.
`--core-path-top-n`	Report statistics for the top K most frequent core path segments.
`--core-path-uninformative-threshold`	Ratio of `*` or `PRIVATE` hops allowed in a path for core analysis.

Plotting Options

Argument	Description
`--plot-metrics`	Required for plotting. One or more metrics to plot. Choices: `rtt`, `rtt_std`, `first_hop_rtt`, `observed_pathlen`, `successful_pathlen`, `success_rate`, `timeout_rate`.
`--plot-aggregations`	Aggregation levels for plots. Default: none. Choices: `daily`, `hourly`, `dayofweek`, `hourofday`.
`--plot-types`	Plot types to generate. Choices: `hist`, `scatter`, `rolling`, `boxplot`, `path_perf`.
`--formats`	Output image formats for plots (`png`, `pdf`, `svg`, ...).
`--highlight-outliers`	Highlight RTT outliers on the scatter plot.

Performance & Logging Options

Argument	Description
`-w`, `--workers`	Number of worker processes for processing.
`--chunk-size`	Number of records per processing chunk.
`--no-line-count`	Disable initial line count for progress bar estimation.
`-l`, `--log-level`	Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
`-q`, `--quiet`	Suppress console log output.

`scripts/import.py`

Argument	Description
`FILE_PATTERN`	Positional argument. Paths or patterns for input JSON files.
`--host`, `--port`, `--db`, `--user`, `--password`	ClickHouse connection details.
`--batch-size`	Number of rows per ClickHouse insert batch.
`-w`, `--workers`	Number of worker processes for parsing files.
`--db-insert-workers`	Number of worker threads for database inserts.
`--optimize-final`	Run `OPTIMIZE TABLE FINAL` after the import completes.
`--limit`	Process only the first N input lines in total.
`--imap-chunksize`	Chunksize for the multiprocessing pool.
`--source-`, `--dest-`, `--protocol`	All filtering options from `main.py` are also available here.

`scripts/create_schema.py`

Argument	Description
`--host`, `--port`, `--user`, `--password`	ClickHouse connection details.
`--db`	The name of the database to create and/or apply the schema to.
`--schema`	Path to the `.sql` schema file to execute.

Output Description

JSON Results File (*_results_mode.json): A comprehensive JSON file containing all metadata, processing summaries, statistical aggregations, path analyses, and a list of detected anomalies.
Parquet Data File (*_baseline_data.parquet): Created in baseline mode, this file stores the processed DataFrame. It is used by the analyze mode to perform the K-S and A-D distribution tests.
Plots: Visualizations of key metrics, saved as .png files (or other specified formats).
Log File (.log): A detailed log of the entire run, useful for debugging.
Matrix Notifications: Real-time alerts sent to your configured chat room, summarizing the run and attaching the results file and any plots.

Troubleshooting

ClickHouse 'Max query size exceeded' Error

Problem: When running an analysis from ClickHouse with a very large number of filters (thousands of --source-id or --dest-ip values), you may encounter an error similar to this:

DB::Exception: Max query size exceeded (can be increased with the `max_query_size` setting)

This happens because the script constructs a single, very long SQL query string that exceeds the default safety limit on the ClickHouse server.

Solution: The recommended solution is to increase this limit on the ClickHouse server for the user profile you are connecting with.

Instructions:

Locate your ClickHouse user configuration file. This is typically /etc/clickhouse-server/users.xml or a file inside /etc/clickhouse-server/users.d/.
Edit the file (e.g., sudo nano /etc/clickhouse-server/users.xml).

Inside the profile of your user, add or modify the max_query_size setting.

<!-- Inside /etc/clickhouse-server/users.xml -->
<clickhouse>
    <profiles>
        <default>
            <!-- ... -->

            <!-- Increase the max query size from default (262144=256KiB) -->
            <!-- This value (2097152) is 2 MiB -->
            <max_query_size>2097152</max_query_size>
        </default>
    </profiles>
    <!-- ... -->
</clickhouse>

Save the file and restart the ClickHouse server to apply the changes:
```
sudo systemctl restart clickhouse-server
```

Project Structure

.
├── analyzer_lib/           # Core analysis library
│   ├── analysis/           # Main analysis logic
│   ├── common/             # Shared components (constants, utils, bot)
│   └── data_source/        # Data readers (files, ClickHouse)
├── config.ini.example      # Template for Matrix bot configuration
├── data/                   # Example data files
├── main.py                 # Main analysis script
├── out/                    # Default directory for all outputs
├── requirements.txt        # Project dependencies
├── schema/                 # SQL schema for ClickHouse
└── scripts/                # Helper scripts for import and setup

Acknowledgments

RIPE Atlas: This work relies on the open and extensive data provided by the RIPE Atlas global measurement network.
Google Gemini: Significant help with analysis and code.

License

This project is licensed under the GNU AGPL-3.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github/workflows		.github/workflows
analyzer_lib		analyzer_lib
data		data
out		out
schema		schema
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.ini.example		config.ini.example
coverage.svg		coverage.svg
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

License

Gozzim/RIPE-Atlas-Traceroute-Analysis

Folders and files

Latest commit

History

Repository files navigation

Traceroute Data Analyzer and Anomaly Detector

Purpose

How It Works

Features

Installation

Configuration

Matrix Bot Notifications (Optional)

ClickHouse Connection (Optional)

Usage

1. main.py - The Analyzer

Step 1: Create a Baseline

Step 2: Run an Analysis

2. scripts/create_schema.py - Database Setup

3. scripts/import.py - Data Importer

Command-Line Reference

main.py

Data Source Options

ClickHouse Connection Options

Input/Output Options

Mode Options

Analysis Options

Filtering Options (Source & Destination)

Path Analysis Options

Plotting Options

Performance & Logging Options

scripts/import.py

scripts/create_schema.py

Output Description

Troubleshooting

ClickHouse 'Max query size exceeded' Error

Project Structure

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. `main.py` - The Analyzer

2. `scripts/create_schema.py` - Database Setup

3. `scripts/import.py` - Data Importer

`main.py`

`scripts/import.py`

`scripts/create_schema.py`

Packages