Add pytorch benchmark

# PyTorch-Focused Benchmarking for Image-Based Profiling File Format

## Context

We already maintain baseline benchmarks for the file format itself
(e.g., storage size, raw read/write throughput, sequential I/O).
This effort should focus **only on PyTorch-facing performance**:
how the format behaves when accessed via `torch.utils.data.Dataset`
and `DataLoader`, and how that affects real training or inference workloads.

The intent is to benchmark what users actually experience when using
the format in PyTorch-based image profiling pipelines.

---

## Goals

Design and implement a benchmark suite that answers:

1. How fast and stable is `Dataset.__getitem__` under realistic access patterns?
2. How does performance scale with common PyTorch `DataLoader` settings?
3. Does the format reduce data-loading stalls in end-to-end model workflows?

---

## Non-goals

- Repeating generic file-format benchmarks (on-disk size, raw I/O MB/s)
- Evaluating or optimizing model accuracy
- GPU kernel or model architecture benchmarking

---

## Benchmark Scope

### Track 1 — Dataset / `__getitem__` Microbenchmark

Evaluate the behavior of the Dataset implementation itself.

#### Access patterns
- Random object-level access (e.g., random object IDs)
- Grouped access (e.g., all objects from a site, well, or contiguous range)
- Optional: paired reads per sample (two views, contrastive-style workflows)

#### Metrics
- `__getitem__` latency: p50 / p95 / p99
- Samples per second in a tight loop
- Warm-up vs steady-state behavior

#### Configuration dimensions
- Crop size and shape
- Channel selection
- Output dtype
- Decode path
- Transform tier:
  - none (I/O ceiling)
  - light (normalize, resize)
  - typical (domain-relevant light augmentation)

#### Output
- Structured, machine-readable results (JSON or CSV)
- One row per run, including configuration and environment metadata

---

### Track 2 — PyTorch DataLoader Throughput

Measure performance at the DataLoader output boundary.

#### Parameters to explore
- `num_workers`
- `batch_size`
- `pin_memory`
- `persistent_workers`
- `prefetch_factor` (when applicable)

#### Metrics
- Samples per second
- Batch time distribution (p50 / p95)
- First-batch latency (worker startup overhead)

#### Modes
- I/O-only transforms
- Typical profiling transforms

---

### Track 3 — End-to-End Fixed-Step Model Loop

Evaluate the format in a realistic PyTorch workload.

#### Workloads
- Embedding extraction (forward pass only), or
- Small, standard training loop using a simple reference model

#### Controls
- Fixed number of steps (not epochs)
- Fixed random seed
- Fixed input shape and model
- Optional AMP (must be consistent across runs)

#### Metrics
- Step time (p50 / p95)
- Images per second
- Fraction of time waiting on data (input stall proxy)
- Optional: GPU utilization (best-effort)

---

## Dataset & Sampling Expectations

- Access patterns should reflect image-based profiling use cases:
  - object-level crops
  - grouped site/well reads
- Sampling should be deterministic when seeded
- Support both:
  - small subsets for CI or smoke testing
  - full datasets for real benchmarking

---

## Reproducibility & Reporting

Each benchmark run should capture:

- Configuration parameters used
- Random seed
- Timestamp
- Software versions (PyTorch, CUDA, Python)
- Hardware summary (CPU, RAM, GPU if applicable)
- Storage type if detectable

Results should be easy to aggregate and plot across runs.

---

## Acceptance Criteria

- Benchmarks produce structured, machine-readable output
- Results clearly separate:
  - Dataset-level costs
  - DataLoader-level scaling effects
  - End-to-end training/inference behavior
- Documentation explains:
  - how to run benchmarks
  - how to interpret reported metrics
- A minimal configuration exists for quick execution in CI or local testing
- We generate a plot similar to what's already generated.
- We place the code in similar format and placement in the repo in alignment with other files.

---

## Notes & Risks

- Transform cost must be clearly separated from I/O cost
- Warm-up effects should be measured and reported
- Random-access performance may be affected by OS caching;
  first-pass vs steady-state behavior should be distinguished where possible


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pytorch benchmark #23

PyTorch-Focused Benchmarking for Image-Based Profiling File Format

Context

Goals

Non-goals

Benchmark Scope

Track 1 — Dataset / `getitem` Microbenchmark

Access patterns

Metrics

Configuration dimensions

Output

Track 2 — PyTorch DataLoader Throughput

Parameters to explore

Metrics

Modes

Track 3 — End-to-End Fixed-Step Model Loop

Workloads

Controls

Metrics

Dataset & Sampling Expectations

Reproducibility & Reporting

Acceptance Criteria

Notes & Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add pytorch benchmark #23

Description

PyTorch-Focused Benchmarking for Image-Based Profiling File Format

Context

Goals

Non-goals

Benchmark Scope

Track 1 — Dataset / __getitem__ Microbenchmark

Access patterns

Metrics

Configuration dimensions

Output

Track 2 — PyTorch DataLoader Throughput

Parameters to explore

Metrics

Modes

Track 3 — End-to-End Fixed-Step Model Loop

Workloads

Controls

Metrics

Dataset & Sampling Expectations

Reproducibility & Reporting

Acceptance Criteria

Notes & Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Track 1 — Dataset / `getitem` Microbenchmark