Open
Description
I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1
.
$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0
$ cargo bench --bench set_sum
Compiling [...]
Finished release [optimized] target(s) in 5.51 secs
Running target/release/deps/set_sum-833cf161cf760670
running 4 tests
test rayon_set_sum_parallel ... bench: 2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial ... bench: 7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel ... bench: 10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial ... bench: 7,733,258 ns/iter (+/- 134,850)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
Compiling [...]
Finished release [optimized] target(s) in 6.48 secs
Running target/release/deps/set_sum-833cf161cf760670
running 4 tests
test rayon_set_sum_parallel ... bench: 1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial ... bench: 6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel ... bench: 8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial ... bench: 6,144,775 ns/iter (+/- 75,377)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
rayon_set_sum_parallel
is the showcase for this crate, and it suffers the most from CGU.
From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer
, 34% Iterator::fold
, 28% Sum::sum
, and the hot loop in the first looks like:
16.72 │126d0: cmpq $0x0,(%r12,%rbx,8)
6.73 │126d5: ↓ jne 126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
16.65 │126d7: inc %rbx
0.00 │126da: cmp %rbp,%rbx
7.20 │126dd: ↑ jb 126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
0.05 │126df: ↓ jmp 1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
26.93 │126e1: mov 0x0(%r13,%rbx,4),%eax
4.26 │126e6: movq $0x1,0x38(%rsp)
2.27 │126ef: mov %rax,0x40(%rsp)
1.88 │126f4: mov %r14,%rdi
4.58 │126f7: → callq 15b90 <<u64 as core::iter::traits::Sum>::sum>
0.68 │126fc: movq $0x1,0x38(%rsp)
2.58 │12705: mov %r15,0x40(%rsp)
0.62 │1270a: movq $0x1,0x48(%rsp)
0.31 │12713: mov %rax,0x50(%rsp)
0.49 │12718: movb $0x0,0x58(%rsp)
2.50 │1271d: xor %esi,%esi
0.41 │1271f: mov %r14,%rdi
0.85 │12722: → callq 153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
1.30 │12727: mov %rax,%r15
2.16 │1272a: ↑ jmp 126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>
With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer
, with this hot loop:
2.28 │1426d: test %rbx,%rbx
│14270: ↓ je 14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
5.40 │14272: mov (%rbx),%ebx
2.75 │14274: add %rbx,%rax
1.47 │14277: lea (%rdx,%rsi,4),%rbx
0.21 │1427b: nopl 0x0(%rax,%rax,1)
29.56 │14280: cmp %rdi,%rsi
0.04 │14283: ↓ jae 14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
2.87 │14285: add $0x4,%rbx
20.22 │14289: cmpq $0x0,(%rcx,%rsi,8)
1.48 │1428e: lea 0x1(%rsi),%rsi
8.00 │14292: ↑ je 14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
25.25 │14294: ↑ jmp 1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>
Metadata
Metadata
Assignees
Labels
Area: Code generationCategory: An issue proposing an enhancement or a PR with one.Category: An issue highlighting optimization opportunities or PRs implementing suchIssue: Problems and improvements with respect to performance of generated code.Relevant to the compiler team, which will review and decide on the PR/issue.