Batch encode: coarsen rayon tasks with with_min_len by sebpop · Pull Request #2077 · huggingface/tokenizers

sebpop · 2026-05-28T23:09:12Z

encode_batch, encode_batch_char_offsets and encode_batch_fast ran inputs.into_maybe_par_iter().map(...).collect(). rayon's bridge_producer_consumer splits that down toward one item per task, and at high core counts the per-task wake/join signaling (an LSE CAS/LDADD on aarch64) becomes a large fraction of cycles.

Route all three through a shared run_batch helper that:

takes an explicit serial fast path (into_iter().map().collect()) when parallelism is disabled or only one worker is available, so single-threaded callers pay no rayon cost at all; and
otherwise distributes work with IndexedParallelIterator::with_min_len(min_len), which sets a floor on the items processed sequentially per task. min_len aims for ~4 tasks per worker (load balance) and is capped at 8 (avoid oversized tasks): min_len = ceil(n / (threads*4)).clamp(1, 8).

This is intentionally the smallest change that coarsens the tasks: it is one rayon method call rather than a hand-written work queue, and it keeps rayon's scheduler.

Measured on three machines, bpe_benchmark/bpe-encode/BPE GPT2 encode batch (data/big.txt, encode_batch through the post-processor), baseline = current main, patched = this change.

NVIDIA Vera (aarch64 Olympus, 88 physical / 176 logical):

  threads  baseline      patched       change
  1        3.91 MiB/s    4.63 MiB/s    +18%   (serial fast path)
  88       18.03 MiB/s   19.57 MiB/s   +8.5%
  176      17.21 MiB/s   18.67 MiB/s   +8.5%

AMD EPYC 9124 (x86_64, 16 physical / 32 logical):

  threads  baseline      patched       change
  1        3.69 MiB/s    3.89 MiB/s    +5.5%
  16       23.38 MiB/s   25.05 MiB/s   +7.1%
  32       24.11 MiB/s   25.81 MiB/s   +7.0%

Apple M3 (aarch64, 12 cores, dev host):

  threads  baseline      patched       change
  1        4.66 MiB/s    4.70 MiB/s    ~0
  6        14.97 MiB/s   14.67 MiB/s   ~0 (within noise)
  12       19.08 MiB/s   17.89 MiB/s   within thermal noise

The M3 is a thermally-limited 12-core laptop; its criterion intervals are wide (+/-5%) and vary across a long measurement session, and the result is insensitive to the min_len cap (cap 8/4/2 all land in the same band). Treat it as approximately neutral, not a regression — the reliable signal is the two isolated servers.

Atomics / bottleneck, perf record -g --call-graph fp -F 4999:

aarch64 (Vera) at 88T -- LSE atomic outlined-call share (sum of __aarch64_*):

  baseline  4.97%
  patched   0.91%   (-5.5x)

Fewer rayon tasks means fewer per-task atomic operations, which is why the LSE share drops sharply on aarch64 where each LSE op is expensive.

x86_64 (EPYC) at 16T -- the same rayon/crossbeam-epoch machinery is cheap here, so there is no comparable atomic bottleneck to remove:

  symbol                                       baseline  patched
  crossbeam_epoch::default::with_handle          1.19%    1.62%
  rayon_core::WorkerThread::wait_until_cold      0.69%    0.96%
  crossbeam_epoch::internal::Global::try_advance 0.19%    0.25%

The x86_64 gain therefore comes from reduced task-scheduling overhead generally, not from atomics -- x86_64 never had the atomic problem that aarch64 has.

Remaining ceiling: on aarch64, crossbeam-epoch dispatch (try_advance + with_handle + wait_until_cold) is ~56% of cycles in both baseline and patched -- with_min_len does not touch it. Removing that requires replacing rayon's scheduler on the hot path and is left to a follow-up.

cargo test --lib --features http
 201 passed, 0 failed.

`encode_batch`, `encode_batch_char_offsets` and `encode_batch_fast` ran `inputs.into_maybe_par_iter().map(...).collect()`. rayon's `bridge_producer_consumer` splits that down toward one item per task, and at high core counts the per-task wake/join signaling (an LSE CAS/LDADD on aarch64) becomes a large fraction of cycles. Route all three through a shared `run_batch` helper that: - takes an explicit serial fast path (`into_iter().map().collect()`) when parallelism is disabled or only one worker is available, so single-threaded callers pay no rayon cost at all; and - otherwise distributes work with `IndexedParallelIterator::with_min_len(min_len)`, which sets a floor on the items processed sequentially per task. `min_len` aims for ~4 tasks per worker (load balance) and is capped at 8 (avoid oversized tasks): `min_len = ceil(n / (threads*4)).clamp(1, 8)`. This is intentionally the smallest change that coarsens the tasks: it is one rayon method call rather than a hand-written work queue, and it keeps rayon's scheduler. Because `run_batch` drives rayon directly instead of going through `into_maybe_par_iter`, it must reproduce that helper's side effect of recording parallelism use. A new `parallelism::set_parallelism_used()` is called whenever the parallel path is taken, so the Python `pthread_atfork` child hook still disables rayon in forked children (the documented multiprocessing-deadlock protection). A regression test, `encode_batch_marks_parallelism_used`, asserts `has_parallelism_been_used()` after a parallel `encode_batch`. Measured on three machines, `bpe_benchmark`/`bpe-encode/BPE GPT2 encode batch` (data/big.txt, encode_batch through the post-processor), baseline = current main, patched = this change. NVIDIA Vera (aarch64 Olympus, 88 physical / 176 logical): threads baseline patched change 1 3.91 MiB/s 4.63 MiB/s +18% (serial fast path) 88 18.03 MiB/s 19.57 MiB/s +8.5% 176 17.21 MiB/s 18.67 MiB/s +8.5% AMD EPYC 9124 (x86_64, 16 physical / 32 logical): threads baseline patched change 1 3.69 MiB/s 3.89 MiB/s +5.5% 16 23.38 MiB/s 25.05 MiB/s +7.1% 32 24.11 MiB/s 25.81 MiB/s +7.0% Apple M3 (aarch64, 12 cores, dev host): threads baseline patched change 1 4.66 MiB/s 4.70 MiB/s ~0 6 14.97 MiB/s 14.67 MiB/s ~0 (within noise) 12 19.08 MiB/s 17.89 MiB/s within thermal noise The M3 is a thermally-limited 12-core laptop; its criterion intervals are wide (+/-5%) and vary across a long measurement session, and the result is insensitive to the `min_len` cap (cap 8/4/2 all land in the same band). Treat it as approximately neutral, not a regression — the reliable signal is the two isolated servers. Atomics / bottleneck, `perf record -g --call-graph fp -F 4999`: aarch64 (Vera) at 88T -- LSE atomic outlined-call share (sum of `__aarch64_*`): baseline 4.97% patched 0.91% (-5.5x) Fewer rayon tasks means fewer per-task atomic operations, which is why the LSE share drops sharply on aarch64 where each LSE op is expensive. x86_64 (EPYC) at 16T -- the same rayon/crossbeam-epoch machinery is cheap here, so there is no comparable atomic bottleneck to remove: symbol baseline patched crossbeam_epoch::default::with_handle 1.19% 1.62% rayon_core::WorkerThread::wait_until_cold 0.69% 0.96% crossbeam_epoch::internal::Global::try_advance 0.19% 0.25% The x86_64 gain therefore comes from reduced task-scheduling overhead generally, not from atomics -- x86_64 never had the atomic problem that aarch64 has. Remaining ceiling: on aarch64, crossbeam-epoch dispatch (`try_advance` + `with_handle` + `wait_until_cold`) is ~56% of cycles in both baseline and patched -- with_min_len does not touch it. Removing that requires replacing rayon's scheduler on the hot path and is left to a follow-up. cargo test --lib --features http: 202 passed, 0 failed.

HuggingFaceDocBuilderDev · 2026-05-29T08:46:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sebpop · 2026-05-29T09:13:45Z

The failing check (test_tutorial_train_from_iterators.py::TestTrainFromIterators::test_datasets, an @pytest.mark.network test) is not caused by this PR — it's a pre-existing failure on main. This PR only changes the Rust encode_batch path (run_batch/with_min_len); the test trains via train_from_iterator over datasets.load_dataset(...), which this PR doesn't touch.

Root cause: the test loads the bare dataset id "wikitext", which has been relocated on the Hub (https://huggingface.co/api/datasets/wikitext → 307 redirect). Newer huggingface_hub enforces the namespace/name form and rejects the bare id with HfUriError: Invalid HF URI 'hf://datasets/wikitext@/.huggingface.yaml'. Repository id must be 'namespace/name'. pyproject.toml pins huggingface_hub>=0.16.4,<2.0, so CI picked up a hub version that enforces this.

Fix is a one-liner — use the canonical relocated id Salesforce/wikitext (returns 200) at both call sites (lines 37 and 71); the config wikitext-103-raw-v1 is unchanged. Happy to send it as a separate small PR since it's independent of the batch-encode change, or you can fold it in — your call.

--- a/bindings/python/tests/documentation/test_tutorial_train_from_iterators.py
+++ b/bindings/python/tests/documentation/test_tutorial_train_from_iterators.py
@@ -34,7 +34,7 @@ class TestTrainFromIterators:
         # START load_dataset
         import datasets  # type: ignore[import-not-found]
 
-        dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train+test+validation")
+        dataset = datasets.load_dataset("Salesforce/wikitext", "wikitext-103-raw-v1", split="train+test+validation")
         # END load_dataset
@@ -68,7 +68,7 @@ class TestTrainFromIterators:
         # In order to keep tests fast, we only use the first 100 examples
         os.environ["TOKENIZERS_PARALLELISM"] = "true"
-        dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train[0:100]")
+        dataset = datasets.load_dataset("Salesforce/wikitext", "wikitext-103-raw-v1", split="train[0:100]")

sebpop · 2026-05-29T09:30:43Z

CI fail should be fixed by: #2079

vyalamar · 2026-05-30T08:18:43Z

@sebpop could you open keep the earlier PR open? I was working on the benchmarks I will publish .

sebpop mentioned this pull request May 28, 2026

Batch encode: simple lock-free scheduler #2044

Closed

sebpop force-pushed the p2-second branch from 65a59d6 to 2781f7f Compare May 28, 2026 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch encode: coarsen rayon tasks with with_min_len#2077

Batch encode: coarsen rayon tasks with with_min_len#2077
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p2-second

sebpop commented May 28, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

vyalamar commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sebpop commented May 28, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

vyalamar commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants