Skip to content

Conversation

Copy link

Copilot AI commented Feb 4, 2026

The PyTorch benchmark measures end-to-end performance but doesn't separate numpy array loading from tensor conversion overhead. This makes it difficult to identify whether I/O or conversion is the bottleneck.

Changes

New Dataset class: OMEArrowDatasetNumpy

  • Returns np.ndarray instead of torch.Tensor
  • Enables isolated measurement of numpy loading time

New benchmark: benchmark_numpy_vs_torch()
Measures three operations separately:

  • Numpy array loading (Dataset → numpy)
  • Tensor conversion (torch.from_numpy().float())
  • Total time (Dataset → torch)

Metrics tracked:

  • p50/p95/p99 latencies for each operation
  • Conversion overhead as percentage of total time
  • Throughput (samples/sec) for numpy vs torch

Integration:

  • Runs as Track 4 alongside existing tracks
  • Results: data/pytorch_benchmark_track4.parquet
  • Plots: Time breakdown and conversion overhead percentage

Example output

[Track 4] format=Parquet
  Run 1/3:
    Numpy:      p50=0.042ms, throughput=23809.5 samples/s
    Conversion: p50=0.008ms, overhead=16.0%
    Torch:      p50=0.050ms, throughput=20000.0 samples/s

Expected conversion overhead: 10-20% for table formats, higher for small images (20-40%), lower for large images (5-10%).

When overhead >30%: Consider zero-copy operations, batched conversions, or staying in torch throughout pipeline.

When overhead <10%: Focus on I/O optimization (storage, caching, workers) rather than conversion.

Original prompt

This section details on the original issue you should resolve

<issue_title>Add pytorch benchmark</issue_title>
<issue_description># PyTorch-Focused Benchmarking for Image-Based Profiling File Format

Context

We already maintain baseline benchmarks for the file format itself
(e.g., storage size, raw read/write throughput, sequential I/O).
This effort should focus only on PyTorch-facing performance:
how the format behaves when accessed via torch.utils.data.Dataset
and DataLoader, and how that affects real training or inference workloads.

The intent is to benchmark what users actually experience when using
the format in PyTorch-based image profiling pipelines.


Goals

Design and implement a benchmark suite that answers:

  1. How fast and stable is Dataset.__getitem__ under realistic access patterns?
  2. How does performance scale with common PyTorch DataLoader settings?
  3. Does the format reduce data-loading stalls in end-to-end model workflows?

Non-goals

  • Repeating generic file-format benchmarks (on-disk size, raw I/O MB/s)
  • Evaluating or optimizing model accuracy
  • GPU kernel or model architecture benchmarking

Benchmark Scope

Track 1 — Dataset / __getitem__ Microbenchmark

Evaluate the behavior of the Dataset implementation itself.

Access patterns

  • Random object-level access (e.g., random object IDs)
  • Grouped access (e.g., all objects from a site, well, or contiguous range)
  • Optional: paired reads per sample (two views, contrastive-style workflows)

Metrics

  • __getitem__ latency: p50 / p95 / p99
  • Samples per second in a tight loop
  • Warm-up vs steady-state behavior

Configuration dimensions

  • Crop size and shape
  • Channel selection
  • Output dtype
  • Decode path
  • Transform tier:
    • none (I/O ceiling)
    • light (normalize, resize)
    • typical (domain-relevant light augmentation)

Output

  • Structured, machine-readable results (JSON or CSV)
  • One row per run, including configuration and environment metadata

Track 2 — PyTorch DataLoader Throughput

Measure performance at the DataLoader output boundary.

Parameters to explore

  • num_workers
  • batch_size
  • pin_memory
  • persistent_workers
  • prefetch_factor (when applicable)

Metrics

  • Samples per second
  • Batch time distribution (p50 / p95)
  • First-batch latency (worker startup overhead)

Modes

  • I/O-only transforms
  • Typical profiling transforms

Track 3 — End-to-End Fixed-Step Model Loop

Evaluate the format in a realistic PyTorch workload.

Workloads

  • Embedding extraction (forward pass only), or
  • Small, standard training loop using a simple reference model

Controls

  • Fixed number of steps (not epochs)
  • Fixed random seed
  • Fixed input shape and model
  • Optional AMP (must be consistent across runs)

Metrics

  • Step time (p50 / p95)
  • Images per second
  • Fraction of time waiting on data (input stall proxy)
  • Optional: GPU utilization (best-effort)

Dataset & Sampling Expectations

  • Access patterns should reflect image-based profiling use cases:
    • object-level crops
    • grouped site/well reads
  • Sampling should be deterministic when seeded
  • Support both:
    • small subsets for CI or smoke testing
    • full datasets for real benchmarking

Reproducibility & Reporting

Each benchmark run should capture:

  • Configuration parameters used
  • Random seed
  • Timestamp
  • Software versions (PyTorch, CUDA, Python)
  • Hardware summary (CPU, RAM, GPU if applicable)
  • Storage type if detectable

Results should be easy to aggregate and plot across runs.


Acceptance Criteria

  • Benchmarks produce structured, machine-readable output
  • Results clearly separate:
    • Dataset-level costs
    • DataLoader-level scaling effects
    • End-to-end training/inference behavior
  • Documentation explains:
    • how to run benchmarks
    • how to interpret reported metrics
  • A minimal configuration exists for quick execution in CI or local testing
  • We generate a plot similar to what's already generated.
  • We place the code in similar format and placement in the repo in alignment with other files.

Notes & Risks

  • Transform cost must be clearly separated from I/O cost
  • Warm-up effects should be measured and reported
  • Random-access performance may be affected by OS caching;
    first-pass vs steady-state behavior should be distinguished where possible
    </issue_description>

<agent_instructions>Please work on this in alignment with other work which has occurred in this project.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits February 4, 2026 21:35
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Copilot AI changed the title [WIP] Add PyTorch benchmarking for image-based profiling Add PyTorch DataLoader benchmarks for OME-Arrow format Feb 4, 2026
Copilot AI requested a review from d33bs February 4, 2026 21:50
Copilot AI and others added 3 commits February 4, 2026 22:10
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Copilot AI changed the title Add PyTorch DataLoader benchmarks for OME-Arrow format Add missing formats and improve label readability in PyTorch benchmark Feb 4, 2026
Copilot AI and others added 2 commits February 4, 2026 22:27
…mats are generated

Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Copilot AI changed the title Add missing formats and improve label readability in PyTorch benchmark Optimize DataLoader performance and fix missing format data generation Feb 4, 2026
Copilot AI and others added 4 commits February 6, 2026 21:07
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Co-authored-by: d33bs <3738008+d33bs@users.noreply.github.com>
Copilot AI changed the title Optimize DataLoader performance and fix missing format data generation Add Track 4: Numpy vs Torch tensor conversion benchmark Feb 6, 2026
@d33bs
Copy link
Member

d33bs commented Feb 6, 2026

This was an experiment with GH copilot. Closing this for now with a later follow up which is more careful.

@d33bs d33bs closed this Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add pytorch benchmark

2 participants