Skip to content

ender672/wikipediastorage

Repository files navigation

Wikipedia Storage Benchmarks

Benchmark suite comparing approaches to storing revision histories. Each storage approach packs a sequence of text revisions into a single file and is measured on packed size and random read time for old revisions.

Prerequisites

  • Rust (stable)
  • uv (Python package runner)

Quick start

# 1. Download all three benchmark datasets
uv run download_all.py --quick

# 2. Run tests, benchmarks, and generate charts
uv run benchmark_all.py

Downloading datasets

The benchmark suite uses three datasets with different revision characteristics:

Dataset Source Revisions Character
George_W._Bush Wikipedia API ~15,000 Prose, many small edits
yahoo.com Wayback Machine ~13,500 HTML, large structural changes
btrfs_inode.c Linux kernel git ~2,700 C source code, steady growth

Download all three with a single command:

uv run download_all.py

This will take a long time (potentially hours) due to API rate limits. The archive.org API in particular has a conservative wait time between calls and is slow. All downloads support --resume, so you can interrupt and restart safely:

uv run download_all.py --resume

To do a relatively quick smoke test (full btrfs, Wikipedia and Wayback capped at 1,000 revisions):

uv run download_all.py --quick

Running benchmarks

Run the full pipeline (tests, benchmarks, charts, and tables) with:

uv run benchmark_all.py

Use --parallel to pack all backends in parallel, then benchmark reads separately. You lose packfile generation time and memory use though:

uv run benchmark_all.py --parallel

Filter to specific approaches with --approaches (case-insensitive substring match):

uv run benchmark_all.py --approaches "revlog/lz4/fossil,naive/zlib"

Web server

An interactive web UI lets you browse revision histories and compare how each storage backend reconstructs any revision. It requires pre-built web cache data for each dataset.

# Prepare cached data for a single dataset (run once, or after dataset changes)
cd rust && cargo run --release --bin prepare_web_cache -- ../revisions/George_W._Bush

# Start the server (defaults to port 8080)
cd rust && cargo run --release --bin web_server -- ../revisions

For production deployment, deploy.sh builds the binaries, prepares web cache data for all datasets, uploads everything, and restarts the server.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors