Wikipedia Storage Benchmarks

Benchmark suite comparing approaches to storing revision histories. Each storage approach packs a sequence of text revisions into a single file and is measured on packed size and random read time for old revisions.

Prerequisites

Rust (stable)
uv (Python package runner)

Quick start

# 1. Download all three benchmark datasets
uv run download_all.py --quick

# 2. Run tests, benchmarks, and generate charts
uv run benchmark_all.py

Downloading datasets

The benchmark suite uses three datasets with different revision characteristics:

Dataset	Source	Revisions	Character
`George_W._Bush`	Wikipedia API	~15,000	Prose, many small edits
`yahoo.com`	Wayback Machine	~13,500	HTML, large structural changes
`btrfs_inode.c`	Linux kernel git	~2,700	C source code, steady growth

Download all three with a single command:

uv run download_all.py

This will take a long time (potentially hours) due to API rate limits. The archive.org API in particular has a conservative wait time between calls and is slow. All downloads support --resume, so you can interrupt and restart safely:

uv run download_all.py --resume

To do a relatively quick smoke test (full btrfs, Wikipedia and Wayback capped at 1,000 revisions):

uv run download_all.py --quick

Running benchmarks

Run the full pipeline (tests, benchmarks, charts, and tables) with:

uv run benchmark_all.py

Use --parallel to pack all backends in parallel, then benchmark reads separately. You lose packfile generation time and memory use though:

uv run benchmark_all.py --parallel

Filter to specific approaches with --approaches (case-insensitive substring match):

uv run benchmark_all.py --approaches "revlog/lz4/fossil,naive/zlib"

Web server

An interactive web UI lets you browse revision histories and compare how each storage backend reconstructs any revision. It requires pre-built web cache data for each dataset.

# Prepare cached data for a single dataset (run once, or after dataset changes)
cd rust && cargo run --release --bin prepare_web_cache -- ../revisions/George_W._Bush

# Start the server (defaults to port 8080)
cd rust && cargo run --release --bin web_server -- ../revisions

For production deployment, deploy.sh builds the binaries, prepares web cache data for all datasets, uploads everything, and restarts the server.

Name		Name	Last commit message	Last commit date
Latest commit History 289 Commits
rust		rust
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
benchmark_all.py		benchmark_all.py
compute_distances.py		compute_distances.py
deploy.env.example		deploy.env.example
deploy.sh		deploy.sh
download_all.py		download_all.py
download_wayback.py		download_wayback.py
download_wikipedia.py		download_wikipedia.py
extract_git_history.py		extract_git_history.py
generate_chart.py		generate_chart.py
generate_table.py		generate_table.py
generate_variance_chart.py		generate_variance_chart.py
provision.sh		provision.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Storage Benchmarks

Prerequisites

Quick start

Downloading datasets

Running benchmarks

Web server

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Storage Benchmarks

Prerequisites

Quick start

Downloading datasets

Running benchmarks

Web server

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages