Skip to content

35% performance regression in generated code since 1.24 #53833

Open
@ruuda

Description

@ruuda

Summary

Claxon, when compiled with Rust 1.25.0 or later, takes 1.36 times as long to run as a version compiled with Rust 1.23.0. When compiled with Rust 1.24.0, it takes 1.45 times as long to run.

It looks like there was a severe regression in generated code in 1.24.0. 1.25.0 improved a bit again, but is still significantly worse than 1.23.0.

Steps to reproduce

Prepare:

git clone https://github.com/ruuda/claxon --branch v0.4.1
cd claxon/testsamples
./populate.sh
cp p2.flac p3.flac p4.flac extra
cd ..
mkdir rust_bench

You can also copy your own files into testsamples/extra if you happen to have some lying around.

Then, with a Rust 1.23.0 toolchain, or after changing this line to use cargo +1.23.0:

tools/benchmark.sh rust_bench/23

Then, with a Rust 1.25.0 toolchain, or after updating the script to use cargo +1.25.0:

tools/benchmark.sh rust_bench/25
tools/compare_benches.r rust_bench/23_*_all.dat rust_bench/25_*_all.dat

Output in my case:

| p10 |  18.0 ± 0.3 ns    | 1.370 ± 0.033 |
| p50 |  18.9 ± 0.4 ns    | 1.363 ± 0.043 |
| p90 |  21.1 ± 2.1 ns    | 1.388 ± 0.181 |
| μ   |  19.4 ± 0.7 ns    | 1.359 ± 0.065 |
| τ   |  65.9 ± 3.6 MiB/s | 0.736 ± 0.053 |

The numbers in the rightmost column show the running time of the benchmark compiled with Rust 1.25.0 relative to the running time of the benchmark compiled with Rust 1.23.0.

For Rust 1.23.0 vs Rust 1.24.0 I get these results:

| p10 |  19.8 ± 0.2 ns    | 1.501 ± 0.028 |
| p50 |  20.5 ± 0.2 ns    | 1.473 ± 0.036 |
| p90 |  21.4 ± 0.4 ns    | 1.413 ± 0.120 |
| μ   |  20.7 ± 0.2 ns    | 1.449 ± 0.048 |
| τ   |  61.8 ± 3.3 MiB/s | 0.690 ± 0.049 |

For Rust 1.23.0 vs Rust 1.28.0 I get these results:

| p10 |  18.3 ± 0.2 ns    | 1.392 ± 0.026 |
| p50 |  18.9 ± 0.1 ns    | 1.360 ± 0.031 |
| p90 |  19.7 ± 0.4 ns    | 1.301 ± 0.111 |
| μ   |  19.1 ± 0.2 ns    | 1.339 ± 0.043 |
| τ   |  66.9 ± 3.8 MiB/s | 0.747 ± 0.055 |

For Rust 1.29.0-beta.1 and 1.30.0-nightly (3edb355 2018-08-03) I get similar results.

To show that the setup works, this is Rust 1.13.0 vs Rust 1.23.0, which shows no significant difference:

| p10 |  13.2 ± 0.2 ns    | 0.989 ± 0.022 |
| p50 |  13.9 ± 0.3 ns    | 0.993 ± 0.027 |
| p90 |  15.2 ± 1.3 ns    | 1.036 ± 0.088 |
| μ   |  14.3 ± 0.4 ns    | 1.013 ± 0.034 |
| τ   |  89.6 ± 4.2 MiB/s | 0.988 ± 0.067 |

And Rust 1.25.0 vs Rust 1.28.0 does not show a significant difference either:

| p10 |  18.3 ± 0.2 ns    | 1.016 ± 0.021 |
| p50 |  18.9 ± 0.1 ns    | 0.998 ± 0.024 |
| p90 |  19.7 ± 0.4 ns    | 0.937 ± 0.097 |
| μ   |  19.1 ± 0.2 ns    | 0.985 ± 0.037 |
| τ   |  66.9 ± 3.8 MiB/s | 1.015 ± 0.080 |

Details

Claxon is a decoder for the flac audio format; the benchmark decodes every file in testsamples/extra 5 times and collects statistics about the duration.

The benchmark is compiled with -C target_cpu=native.

I use a Skylake i7.

vendor_id	: GenuineIntel
cpu family	: 6
model		: 94
model name	: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping	: 3
microcode	: 0xc6
cpu MHz		: 3394.795
cache size	: 6144 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 5186.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.P-mediumMedium priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.regression-from-stable-to-stablePerformance or correctness regression from one stable version to another.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions