Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 41 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,58 @@

Benchmarking [OME Arrow](https://github.com/WayScience/ome-arrow) through Parquet, Vortex, LanceDB, and more.

## Available Benchmarks

### File Format Benchmarks

1. **compare_parquet_vortex_lance.py** - Wide dataset benchmark (~100k rows × 4k columns)
2. **compare_parquet_vortex_lance_ome.py** - OME-Arrow variant with image column
3. **compare_ome_arrow_only.py** - OME-Arrow-only + OME-Zarr + TIFF comparison

### PyTorch Integration Benchmark

4. **pytorch_benchmark.py** - PyTorch-focused performance testing
- Track 1: Dataset `__getitem__` microbenchmark
- Track 2: DataLoader throughput
- Track 3: End-to-end model training loop

See [PyTorch Benchmark Documentation](docs/pytorch_benchmark.md) for details.

## Running benchmarks

1. Create and sync a uv environment (includes parquet, lancedb, vortex-data):
1. Create and sync a uv environment (includes parquet, lancedb, vortex-data, pytorch):

```bash
uv venv
uv sync
```

2. Launch Jupyter and open `notebooks/compare_parquet_vortex_lance.ipynb`:
2. Run individual benchmarks:

```bash
# File format benchmarks
uv run python src/benchmarks/compare_parquet_vortex_lance.py
uv run python src/benchmarks/compare_parquet_vortex_lance_ome.py
uv run python src/benchmarks/compare_ome_arrow_only.py

# PyTorch benchmark
uv run python src/benchmarks/pytorch_benchmark.py
```

Or run all benchmarks at once:

```bash
uv run python <benchmark file>
poe run-benchmarks
```

The benchmarks defaults to ~100,000 rows x ~4,000 columns of `float64` data and ~50 columns of `string` data. Lower `N_ROWS`/`N_COLS` in the config cell if you hit memory pressure (especially before converting to pandas for the CSV benchmark).
## Configuration

The benchmarks default to ~100,000 rows × ~4,000 columns of `float64` data and ~50 columns of `string` data. Lower `N_ROWS`/`N_COLS` in the config section if you hit memory pressure.

For PyTorch benchmarks, adjust `N_ROWS` (default: 1,000 images) and other parameters in `src/benchmarks/pytorch_benchmark.py`. See the [documentation](docs/pytorch_benchmark.md) for configuration details.

An OME-Arrow variant lives at `notebooks/compare_parquet_vortex_lance_ome.py` which adds a single OME image column (random 100x100) alongside the existing columns.
## Output

An OME-Arrow-only + OME-Zarr benchmark lives at `notebooks/compare_ome_arrow_only.pyt`, focusing on a single OME image column and a directory-per-image OME-Zarr comparison.
Benchmarks generate:
- **Data files**: Results in Parquet and JSON format (`data/` directory)
- **Plots**: Visualizations of benchmark results (`images/` directory)
141 changes: 141 additions & 0 deletions docs/TRACK4_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Track 4: Numpy vs Torch Benchmark - Quick Reference

## What It Does

Measures the performance impact of loading data into numpy arrays versus torch tensors by separating:
1. **Numpy loading time** - Getting data as numpy arrays
2. **Tensor conversion time** - Converting numpy to torch
3. **Total time** - End-to-end torch tensor retrieval

## Why It Matters

Helps you understand:
- Is tensor conversion a bottleneck? (10-20% typical, 30%+ is high)
- Should I optimize I/O or conversion?
- What's the overhead of using torch vs staying in numpy?

## Quick Start

### Run the benchmark:
```bash
python src/benchmarks/pytorch_benchmark.py
```

### Test the implementation:
```bash
python test_track4.py
```

### View results:
```bash
# Results data
cat data/pytorch_benchmark_track4.parquet

# Summary with all tracks
cat data/pytorch_benchmark_summary.json

# Visualization
open images/pytorch_benchmark_track4.png
```

## Understanding Output

### Example Output:
```
[Track 4] format=Parquet
Run 1/3:
Numpy: p50=0.042ms, throughput=23809.5 samples/s
Conversion: p50=0.008ms, overhead=16.0%
Torch: p50=0.050ms, throughput=20000.0 samples/s
```

### What This Means:
- **Numpy time (0.042ms)**: Time to load and prepare numpy array
- **Conversion time (0.008ms)**: Time for torch.from_numpy().float()
- **Overhead (16%)**: Conversion is 16% of total time
- **Torch time (0.050ms)**: Total time for torch tensor

### Interpretation:
- **<10% overhead**: Conversion is negligible, focus on I/O
- **10-30% overhead**: Conversion is measurable but acceptable
- **>30% overhead**: Conversion is significant, consider optimization

## Key Metrics

| Metric | Description | Good/Bad |
|--------|-------------|----------|
| `numpy_p50` | Median numpy loading time | Lower is better |
| `conversion_p50` | Median conversion time | Lower is better |
| `torch_p50` | Median total time | Lower is better |
| `conversion_overhead_pct` | Conversion as % of total | <10% good, >30% high |
| `numpy_samples_per_sec` | Numpy throughput | Higher is better |
| `torch_samples_per_sec` | Torch throughput | Higher is better |

## Optimization Tips

### If overhead is HIGH (>30%):
```python
# Instead of:
tensor = torch.from_numpy(arr).float() # Creates copy

# Try:
arr = arr.astype(np.float32) # Convert in numpy first
tensor = torch.from_numpy(arr) # Zero-copy (shares memory)
```

### If overhead is LOW (<10%):
Focus on I/O instead:
- Use faster storage (SSD)
- Increase DataLoader workers
- Pre-load more data
- Choose optimized formats

## Files & Documentation

- **Implementation**: `src/benchmarks/pytorch_benchmark.py`
- `OMEArrowDatasetNumpy` class (returns numpy)
- `benchmark_numpy_vs_torch()` function
- `run_track4()` orchestration
- `plot_track4_results()` visualization

- **Documentation**:
- Quick guide: `docs/pytorch_benchmark.md` (Track 4 section)
- Detailed guide: `docs/track4_implementation.md`
- This file: `docs/TRACK4_README.md`

- **Testing**: `test_track4.py`

- **Outputs**:
- Data: `data/pytorch_benchmark_track4.parquet`
- Plot: `images/pytorch_benchmark_track4.png`
- Summary: `data/pytorch_benchmark_summary.json`

## Common Questions

**Q: Why only table formats?**
A: Directory formats (TIFF, OME-Zarr) don't use OME-Arrow structures, so conversion overhead is different.

**Q: Why is conversion overhead higher for small images?**
A: Conversion time is relatively fixed, so it's a larger percentage of total time when I/O is fast.

**Q: Should I always avoid tensor conversion?**
A: No! Only optimize if overhead is high (>30%). Most of the time, I/O is the bottleneck.

**Q: Can I disable Track 4?**
A: Yes, delete `RESULTS_TRACK4_PATH.exists()` from `RUN_BENCHMARKS` check, or just ignore the results.

## Integration

Track 4 is fully integrated:
- ✅ Runs automatically with main benchmark
- ✅ Results saved alongside other tracks
- ✅ Included in summary JSON
- ✅ Plots generated automatically
- ✅ Uses same configuration as Track 1

## See Also

- Main documentation: `docs/pytorch_benchmark.md`
- Implementation details: `docs/track4_implementation.md`
- Performance optimization: `docs/pytorch_optimization.md`
- Test script: `test_track4.py`
Loading