Skip to content

Batch encode: coarsen rayon tasks with with_min_len#2077

Open
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p2-second
Open

Batch encode: coarsen rayon tasks with with_min_len#2077
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p2-second

Conversation

@sebpop
Copy link
Copy Markdown
Contributor

@sebpop sebpop commented May 28, 2026

encode_batch, encode_batch_char_offsets and encode_batch_fast ran inputs.into_maybe_par_iter().map(...).collect(). rayon's bridge_producer_consumer splits that down toward one item per task, and at high core counts the per-task wake/join signaling (an LSE CAS/LDADD on aarch64) becomes a large fraction of cycles.

Route all three through a shared run_batch helper that:

  • takes an explicit serial fast path (into_iter().map().collect()) when parallelism is disabled or only one worker is available, so single-threaded callers pay no rayon cost at all; and
  • otherwise distributes work with IndexedParallelIterator::with_min_len(min_len), which sets a floor on the items processed sequentially per task. min_len aims for ~4 tasks per worker (load balance) and is capped at 8 (avoid oversized tasks): min_len = ceil(n / (threads*4)).clamp(1, 8).

This is intentionally the smallest change that coarsens the tasks: it is one rayon method call rather than a hand-written work queue, and it keeps rayon's scheduler.

Measured on three machines, bpe_benchmark/bpe-encode/BPE GPT2 encode batch (data/big.txt, encode_batch through the post-processor), baseline = current main, patched = this change.

NVIDIA Vera (aarch64 Olympus, 88 physical / 176 logical):

  threads  baseline      patched       change
  1        3.91 MiB/s    4.63 MiB/s    +18%   (serial fast path)
  88       18.03 MiB/s   19.57 MiB/s   +8.5%
  176      17.21 MiB/s   18.67 MiB/s   +8.5%

AMD EPYC 9124 (x86_64, 16 physical / 32 logical):

  threads  baseline      patched       change
  1        3.69 MiB/s    3.89 MiB/s    +5.5%
  16       23.38 MiB/s   25.05 MiB/s   +7.1%
  32       24.11 MiB/s   25.81 MiB/s   +7.0%

Apple M3 (aarch64, 12 cores, dev host):

  threads  baseline      patched       change
  1        4.66 MiB/s    4.70 MiB/s    ~0
  6        14.97 MiB/s   14.67 MiB/s   ~0 (within noise)
  12       19.08 MiB/s   17.89 MiB/s   within thermal noise

The M3 is a thermally-limited 12-core laptop; its criterion intervals are wide (+/-5%) and vary across a long measurement session, and the result is insensitive to the min_len cap (cap 8/4/2 all land in the same band). Treat it as approximately neutral, not a regression — the reliable signal is the two isolated servers.

Atomics / bottleneck, perf record -g --call-graph fp -F 4999:

aarch64 (Vera) at 88T -- LSE atomic outlined-call share (sum of __aarch64_*):

  baseline  4.97%
  patched   0.91%   (-5.5x)

Fewer rayon tasks means fewer per-task atomic operations, which is why the LSE share drops sharply on aarch64 where each LSE op is expensive.

x86_64 (EPYC) at 16T -- the same rayon/crossbeam-epoch machinery is cheap here, so there is no comparable atomic bottleneck to remove:

  symbol                                       baseline  patched
  crossbeam_epoch::default::with_handle          1.19%    1.62%
  rayon_core::WorkerThread::wait_until_cold      0.69%    0.96%
  crossbeam_epoch::internal::Global::try_advance 0.19%    0.25%

The x86_64 gain therefore comes from reduced task-scheduling overhead generally, not from atomics -- x86_64 never had the atomic problem that aarch64 has.

Remaining ceiling: on aarch64, crossbeam-epoch dispatch (try_advance + with_handle + wait_until_cold) is ~56% of cycles in both baseline and patched -- with_min_len does not touch it. Removing that requires replacing rayon's scheduler on the hot path and is left to a follow-up.

cargo test --lib --features http
 201 passed, 0 failed.

`encode_batch`, `encode_batch_char_offsets` and `encode_batch_fast`
ran `inputs.into_maybe_par_iter().map(...).collect()`.  rayon's
`bridge_producer_consumer` splits that down toward one item per task,
and at high core counts the per-task wake/join signaling (an LSE
CAS/LDADD on aarch64) becomes a large fraction of cycles.

Route all three through a shared `run_batch` helper that:

- takes an explicit serial fast path (`into_iter().map().collect()`)
  when parallelism is disabled or only one worker is available, so
  single-threaded callers pay no rayon cost at all; and
- otherwise distributes work with
  `IndexedParallelIterator::with_min_len(min_len)`, which sets a floor
  on the items processed sequentially per task.  `min_len` aims for
  ~4 tasks per worker (load balance) and is capped at 8 (avoid
  oversized tasks): `min_len = ceil(n / (threads*4)).clamp(1, 8)`.

This is intentionally the smallest change that coarsens the tasks: it
is one rayon method call rather than a hand-written work queue, and it
keeps rayon's scheduler.

Because `run_batch` drives rayon directly instead of going through
`into_maybe_par_iter`, it must reproduce that helper's side effect of
recording parallelism use.  A new `parallelism::set_parallelism_used()`
is called whenever the parallel path is taken, so the Python
`pthread_atfork` child hook still disables rayon in forked children
(the documented multiprocessing-deadlock protection).  A regression
test, `encode_batch_marks_parallelism_used`, asserts
`has_parallelism_been_used()` after a parallel `encode_batch`.

Measured on three machines, `bpe_benchmark`/`bpe-encode/BPE GPT2
encode batch` (data/big.txt, encode_batch through the post-processor),
baseline = current main, patched = this change.

NVIDIA Vera (aarch64 Olympus, 88 physical / 176 logical):

  threads  baseline      patched       change
  1        3.91 MiB/s    4.63 MiB/s    +18%   (serial fast path)
  88       18.03 MiB/s   19.57 MiB/s   +8.5%
  176      17.21 MiB/s   18.67 MiB/s   +8.5%

AMD EPYC 9124 (x86_64, 16 physical / 32 logical):

  threads  baseline      patched       change
  1        3.69 MiB/s    3.89 MiB/s    +5.5%
  16       23.38 MiB/s   25.05 MiB/s   +7.1%
  32       24.11 MiB/s   25.81 MiB/s   +7.0%

Apple M3 (aarch64, 12 cores, dev host):

  threads  baseline      patched       change
  1        4.66 MiB/s    4.70 MiB/s    ~0
  6        14.97 MiB/s   14.67 MiB/s   ~0 (within noise)
  12       19.08 MiB/s   17.89 MiB/s   within thermal noise

The M3 is a thermally-limited 12-core laptop; its criterion intervals
are wide (+/-5%) and vary across a long measurement session, and the
result is insensitive to the `min_len` cap (cap 8/4/2 all land in the
same band).  Treat it as approximately neutral, not a regression — the
reliable signal is the two isolated servers.

Atomics / bottleneck, `perf record -g --call-graph fp -F 4999`:

aarch64 (Vera) at 88T -- LSE atomic outlined-call share
(sum of `__aarch64_*`):

  baseline  4.97%
  patched   0.91%   (-5.5x)

Fewer rayon tasks means fewer per-task atomic operations, which is why
the LSE share drops sharply on aarch64 where each LSE op is expensive.

x86_64 (EPYC) at 16T -- the same rayon/crossbeam-epoch machinery is
cheap here, so there is no comparable atomic bottleneck to remove:

  symbol                                       baseline  patched
  crossbeam_epoch::default::with_handle          1.19%    1.62%
  rayon_core::WorkerThread::wait_until_cold      0.69%    0.96%
  crossbeam_epoch::internal::Global::try_advance 0.19%    0.25%

The x86_64 gain therefore comes from reduced task-scheduling overhead
generally, not from atomics -- x86_64 never had the atomic problem
that aarch64 has.

Remaining ceiling: on aarch64, crossbeam-epoch dispatch
(`try_advance` + `with_handle` + `wait_until_cold`) is ~56% of cycles
in both baseline and patched -- with_min_len does not touch it.
Removing that requires replacing rayon's scheduler on the hot path and
is left to a follow-up.

cargo test --lib --features http: 202 passed, 0 failed.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sebpop
Copy link
Copy Markdown
Contributor Author

sebpop commented May 29, 2026

The failing check (test_tutorial_train_from_iterators.py::TestTrainFromIterators::test_datasets, an @pytest.mark.network test) is not caused by this PR — it's a pre-existing failure on main. This PR only changes the Rust encode_batch path (run_batch/with_min_len); the test trains via train_from_iterator over datasets.load_dataset(...), which this PR doesn't touch.

Root cause: the test loads the bare dataset id "wikitext", which has been relocated on the Hub (https://huggingface.co/api/datasets/wikitext → 307 redirect). Newer huggingface_hub enforces the namespace/name form and rejects the bare id with HfUriError: Invalid HF URI 'hf://datasets/wikitext@/.huggingface.yaml'. Repository id must be 'namespace/name'. pyproject.toml pins huggingface_hub>=0.16.4,<2.0, so CI picked up a hub version that enforces this.

Fix is a one-liner — use the canonical relocated id Salesforce/wikitext (returns 200) at both call sites (lines 37 and 71); the config wikitext-103-raw-v1 is unchanged. Happy to send it as a separate small PR since it's independent of the batch-encode change, or you can fold it in — your call.

--- a/bindings/python/tests/documentation/test_tutorial_train_from_iterators.py
+++ b/bindings/python/tests/documentation/test_tutorial_train_from_iterators.py
@@ -34,7 +34,7 @@ class TestTrainFromIterators:
         # START load_dataset
         import datasets  # type: ignore[import-not-found]
 
-        dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train+test+validation")
+        dataset = datasets.load_dataset("Salesforce/wikitext", "wikitext-103-raw-v1", split="train+test+validation")
         # END load_dataset
@@ -68,7 +68,7 @@ class TestTrainFromIterators:
         # In order to keep tests fast, we only use the first 100 examples
         os.environ["TOKENIZERS_PARALLELISM"] = "true"
-        dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train[0:100]")
+        dataset = datasets.load_dataset("Salesforce/wikitext", "wikitext-103-raw-v1", split="train[0:100]")

@sebpop
Copy link
Copy Markdown
Contributor Author

sebpop commented May 29, 2026

CI fail should be fixed by: #2079

@vyalamar
Copy link
Copy Markdown

@sebpop could you open keep the earlier PR open? I was working on the benchmarks I will publish .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants