Batch encode: coarsen rayon tasks with with_min_len#2077
Conversation
`encode_batch`, `encode_batch_char_offsets` and `encode_batch_fast` ran `inputs.into_maybe_par_iter().map(...).collect()`. rayon's `bridge_producer_consumer` splits that down toward one item per task, and at high core counts the per-task wake/join signaling (an LSE CAS/LDADD on aarch64) becomes a large fraction of cycles. Route all three through a shared `run_batch` helper that: - takes an explicit serial fast path (`into_iter().map().collect()`) when parallelism is disabled or only one worker is available, so single-threaded callers pay no rayon cost at all; and - otherwise distributes work with `IndexedParallelIterator::with_min_len(min_len)`, which sets a floor on the items processed sequentially per task. `min_len` aims for ~4 tasks per worker (load balance) and is capped at 8 (avoid oversized tasks): `min_len = ceil(n / (threads*4)).clamp(1, 8)`. This is intentionally the smallest change that coarsens the tasks: it is one rayon method call rather than a hand-written work queue, and it keeps rayon's scheduler. Because `run_batch` drives rayon directly instead of going through `into_maybe_par_iter`, it must reproduce that helper's side effect of recording parallelism use. A new `parallelism::set_parallelism_used()` is called whenever the parallel path is taken, so the Python `pthread_atfork` child hook still disables rayon in forked children (the documented multiprocessing-deadlock protection). A regression test, `encode_batch_marks_parallelism_used`, asserts `has_parallelism_been_used()` after a parallel `encode_batch`. Measured on three machines, `bpe_benchmark`/`bpe-encode/BPE GPT2 encode batch` (data/big.txt, encode_batch through the post-processor), baseline = current main, patched = this change. NVIDIA Vera (aarch64 Olympus, 88 physical / 176 logical): threads baseline patched change 1 3.91 MiB/s 4.63 MiB/s +18% (serial fast path) 88 18.03 MiB/s 19.57 MiB/s +8.5% 176 17.21 MiB/s 18.67 MiB/s +8.5% AMD EPYC 9124 (x86_64, 16 physical / 32 logical): threads baseline patched change 1 3.69 MiB/s 3.89 MiB/s +5.5% 16 23.38 MiB/s 25.05 MiB/s +7.1% 32 24.11 MiB/s 25.81 MiB/s +7.0% Apple M3 (aarch64, 12 cores, dev host): threads baseline patched change 1 4.66 MiB/s 4.70 MiB/s ~0 6 14.97 MiB/s 14.67 MiB/s ~0 (within noise) 12 19.08 MiB/s 17.89 MiB/s within thermal noise The M3 is a thermally-limited 12-core laptop; its criterion intervals are wide (+/-5%) and vary across a long measurement session, and the result is insensitive to the `min_len` cap (cap 8/4/2 all land in the same band). Treat it as approximately neutral, not a regression — the reliable signal is the two isolated servers. Atomics / bottleneck, `perf record -g --call-graph fp -F 4999`: aarch64 (Vera) at 88T -- LSE atomic outlined-call share (sum of `__aarch64_*`): baseline 4.97% patched 0.91% (-5.5x) Fewer rayon tasks means fewer per-task atomic operations, which is why the LSE share drops sharply on aarch64 where each LSE op is expensive. x86_64 (EPYC) at 16T -- the same rayon/crossbeam-epoch machinery is cheap here, so there is no comparable atomic bottleneck to remove: symbol baseline patched crossbeam_epoch::default::with_handle 1.19% 1.62% rayon_core::WorkerThread::wait_until_cold 0.69% 0.96% crossbeam_epoch::internal::Global::try_advance 0.19% 0.25% The x86_64 gain therefore comes from reduced task-scheduling overhead generally, not from atomics -- x86_64 never had the atomic problem that aarch64 has. Remaining ceiling: on aarch64, crossbeam-epoch dispatch (`try_advance` + `with_handle` + `wait_until_cold`) is ~56% of cycles in both baseline and patched -- with_min_len does not touch it. Removing that requires replacing rayon's scheduler on the hot path and is left to a follow-up. cargo test --lib --features http: 202 passed, 0 failed.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
The failing check (test_tutorial_train_from_iterators.py::TestTrainFromIterators::test_datasets, an @pytest.mark.network test) is not caused by this PR — it's a pre-existing failure on main. This PR only changes the Rust encode_batch path (run_batch/with_min_len); the test trains via train_from_iterator over datasets.load_dataset(...), which this PR doesn't touch. Root cause: the test loads the bare dataset id "wikitext", which has been relocated on the Hub (https://huggingface.co/api/datasets/wikitext → 307 redirect). Newer huggingface_hub enforces the namespace/name form and rejects the bare id with HfUriError: Invalid HF URI 'hf://datasets/wikitext@/.huggingface.yaml'. Repository id must be 'namespace/name'. pyproject.toml pins huggingface_hub>=0.16.4,<2.0, so CI picked up a hub version that enforces this. Fix is a one-liner — use the canonical relocated id Salesforce/wikitext (returns 200) at both call sites (lines 37 and 71); the config wikitext-103-raw-v1 is unchanged. Happy to send it as a separate small PR since it's independent of the batch-encode change, or you can fold it in — your call. |
|
CI fail should be fixed by: #2079 |
|
@sebpop could you open keep the earlier PR open? I was working on the benchmarks I will publish . |
encode_batch,encode_batch_char_offsetsandencode_batch_fastraninputs.into_maybe_par_iter().map(...).collect(). rayon'sbridge_producer_consumersplits that down toward one item per task, and at high core counts the per-task wake/join signaling (an LSE CAS/LDADD on aarch64) becomes a large fraction of cycles.Route all three through a shared
run_batchhelper that:into_iter().map().collect()) when parallelism is disabled or only one worker is available, so single-threaded callers pay no rayon cost at all; andIndexedParallelIterator::with_min_len(min_len), which sets a floor on the items processed sequentially per task.min_lenaims for ~4 tasks per worker (load balance) and is capped at 8 (avoid oversized tasks):min_len = ceil(n / (threads*4)).clamp(1, 8).This is intentionally the smallest change that coarsens the tasks: it is one rayon method call rather than a hand-written work queue, and it keeps rayon's scheduler.
Measured on three machines,
bpe_benchmark/bpe-encode/BPE GPT2 encode batch(data/big.txt, encode_batch through the post-processor), baseline = current main, patched = this change.NVIDIA Vera (aarch64 Olympus, 88 physical / 176 logical):
AMD EPYC 9124 (x86_64, 16 physical / 32 logical):
Apple M3 (aarch64, 12 cores, dev host):
The M3 is a thermally-limited 12-core laptop; its criterion intervals are wide (+/-5%) and vary across a long measurement session, and the result is insensitive to the
min_lencap (cap 8/4/2 all land in the same band). Treat it as approximately neutral, not a regression — the reliable signal is the two isolated servers.Atomics / bottleneck,
perf record -g --call-graph fp -F 4999:aarch64 (Vera) at 88T -- LSE atomic outlined-call share (sum of
__aarch64_*):Fewer rayon tasks means fewer per-task atomic operations, which is why the LSE share drops sharply on aarch64 where each LSE op is expensive.
x86_64 (EPYC) at 16T -- the same rayon/crossbeam-epoch machinery is cheap here, so there is no comparable atomic bottleneck to remove:
The x86_64 gain therefore comes from reduced task-scheduling overhead generally, not from atomics -- x86_64 never had the atomic problem that aarch64 has.
Remaining ceiling: on aarch64, crossbeam-epoch dispatch (
try_advance+with_handle+wait_until_cold) is ~56% of cycles in both baseline and patched -- with_min_len does not touch it. Removing that requires replacing rayon's scheduler on the hot path and is left to a follow-up.