Skip to content

2x benchmark loss in rayon-hash from multiple codegen-units #47665

Open
@cuviper

Description

@cuviper

I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1.

$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

$ cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 5.51 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial   ... bench:   7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel   ... bench:  10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial     ... bench:   7,733,258 ns/iter (+/- 134,850)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.48 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial   ... bench:   6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel   ... bench:   8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial     ... bench:   6,144,775 ns/iter (+/- 75,377)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

rayon_set_sum_parallel is the showcase for this crate, and it suffers the most from CGU.

From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer, 34% Iterator::fold, 28% Sum::sum, and the hot loop in the first looks like:

 16.72 │126d0:   cmpq   $0x0,(%r12,%rbx,8)
  6.73 │126d5: ↓ jne    126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
 16.65 │126d7:   inc    %rbx
  0.00 │126da:   cmp    %rbp,%rbx
  7.20 │126dd: ↑ jb     126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
  0.05 │126df: ↓ jmp    1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
 26.93 │126e1:   mov    0x0(%r13,%rbx,4),%eax
  4.26 │126e6:   movq   $0x1,0x38(%rsp)
  2.27 │126ef:   mov    %rax,0x40(%rsp)
  1.88 │126f4:   mov    %r14,%rdi
  4.58 │126f7: → callq  15b90 <<u64 as core::iter::traits::Sum>::sum>
  0.68 │126fc:   movq   $0x1,0x38(%rsp)
  2.5812705:   mov    %r15,0x40(%rsp)
  0.62 │1270a:   movq   $0x1,0x48(%rsp)
  0.3112713:   mov    %rax,0x50(%rsp)
  0.4912718:   movb   $0x0,0x58(%rsp)
  2.50 │1271d:   xor    %esi,%esi
  0.41 │1271f:   mov    %r14,%rdi
  0.8512722: → callq  153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
  1.3012727:   mov    %rax,%r15
  2.16 │1272a: ↑ jmp    126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>

With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer, with this hot loop:

  2.28 │1426d:   test   %rbx,%rbx
14270: ↓ je     14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  5.4014272:   mov    (%rbx),%ebx
  2.7514274:   add    %rbx,%rax
  1.4714277:   lea    (%rdx,%rsi,4),%rbx
  0.21 │1427b:   nopl   0x0(%rax,%rax,1)
 29.5614280:   cmp    %rdi,%rsi
  0.0414283: ↓ jae    14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  2.8714285:   add    $0x4,%rbx
 20.2214289:   cmpq   $0x0,(%rcx,%rsi,8)
  1.48 │1428e:   lea    0x1(%rsi),%rsi
  8.0014292: ↑ je     14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
 25.2514294: ↑ jmp    1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-codegenArea: Code generationC-enhancementCategory: An issue proposing an enhancement or a PR with one.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions