A comprehensive CPU cache performance benchmarking suite that measures cache behavior across different memory access patterns. The benchmarks are implemented in C for precise memory control, with Python visualization tools for analysis.
-
4 Core Benchmarks:
- Sequential vs Random Access - Demonstrates spatial locality and why sequential memory accesses are faster
- Stride Access - Shows how cache lines affect performance when skipping memory locations
- Array of Structs vs Struct of Arrays - Demonstrates how data layout influences cache efficiency
- Pointer Chasing - Highlights cache misses when following linked structures
-
Smart Visualization
- Automatic Hardware Detection - Automatically detects your CPU model, L1/L2/L3 cache sizes, and OS
- Statistical Analysis - Calculates mean, median, and standard deviation for performance metrics
- Annotated Plots - Visualizes cache boundaries directly on performance graphs
-
Flexible & Portable
- Cross-Platform - Works on Windows, Linux, and macOS
- Parameterized Tests - Vary array sizes, strides, iterations, and access patterns
- Multiple Output Formats - CSV and JSON output for flexible analysis
The following plots were generated on AMD Ryzen 7 6800HS with Radeon Graphics | Debian GNU/Linux 13 (trixie) x86_64:
Compares the performance impact of different data layouts for structured data.
Shows the massive performance gap between sequential access (cache-friendly) and random access (cache-thrashing).
Illustrates how performance degrades as the stride increases and spatial locality is lost.
Demonstrates the latency cost of pointer dereferences, with clear performance steps at L1, L2, and L3 cache boundaries.
# Windows (no make)
gcc -O2 -Wall -std=c11 -o benchmark.exe benchmark.c
# Linux/macOS
make
# or: gcc -O2 -Wall -std=c11 -o benchmark benchmark.c -lrtRun the comprehensive suite to generate data across all sizes and patterns:
./benchmark --comprehensive --output results.csvGenerate plots from your results:
# Create and activate virtual environment (recommended)
# Windows
python -m venv venv
venv\Scripts\activate
# Linux/macOS
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Generate plots
# Windows
python visualize.py results.csv --output plots
# Linux/macOS
python3 visualize.py results.csv --output plotsThis will create plots_sequential.png, plots_pointer_chasing.png, etc.
Usage: ./benchmark [OPTIONS]
Options:
--benchmark <name> Benchmark to run: sequential, stride, aos-soa, pointer-chasing, all (default: all)
--size <bytes> Array size in bytes (default: 1048576 = 1MB)
--stride <n> Stride value for stride benchmark (default: 1)
--iterations <n> Number of iterations (default: 1000000)
--format <csv|json> Output format (default: csv)
--output <file> Write output to file (default: stdout)
--comprehensive Run comprehensive test suite with multiple sizes/strides
--help Show this help message
Usage: python visualize.py input_file [OPTIONS]
Arguments:
input_file Input CSV or JSON file with benchmark results
Options:
--benchmark <name> Which benchmark to visualize (default: all)
--output <file> Output file prefix for plots (default: display interactively)
What it measures: Spatial locality and cache line utilization.
- Sequential: Accesses array elements in order
(0, 1, 2, 3...). Maximizes cache line reuse. - Random: Accesses elements in shuffled order. Causes frequent cache misses.
What it measures: How cache efficiency drops as memory access "skips" over data.
- Accesses elements with a fixed stride
(1, 2, 4, 8, 16...). - Larger strides mean fewer useful data items are loaded per cache line fetch.
What it measures: The impact of data structure layout on cache efficiency.
- AoS (Array of Structs):
struct {int x, y, z;} points[N];- Good for accessing all fields of one object. - SoA (Struct of Arrays):
int x[N], y[N], z[N];- Better for SIMD and accessing single fields across many objects.
What it measures: Pure latency of memory accesses (pointer walking).
- Creates a linked list with nodes scattered randomly in memory.
- Walking the list requires waiting for each memory fetch to complete before knowing the address of the next node.
- This serial dependency makes it extremely sensitive to latency and practically eliminates instruction-level parallelism.
- Timing: Uses high-resolution platform APIs (
QueryPerformanceCounteron Windows,clock_gettimeon POSIX). - Compilation:
-O2optimization level ensures realistic code generation while preventing the compiler from optimizing away the memory accesses entirely (variables are markedvolatilewhere necessary). - Hardware Detection: Python script uses
platform,subprocess(PowerShell on Windows,sysctlon macOS,/syson Linux) to detect exact hardware specifications.
This project is open source and provided for educational and research purposes.