Skip to content

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

Open
@MoSal

Description

@MoSal

Code

I tried this code:

use std::{iter::Peekable, str::CharIndices};
use std::sync::LazyLock as Lazy;

use memchr::memmem::Finder;
use itertools::Itertools;


struct Foo<'a> {
    txt: &'a str,
    ci: Peekable<CharIndices<'a>>,
}

impl<'a> Foo<'a> {
    fn new(txt: &'a str) -> Self {
        let ci = txt.char_indices().peekable();
        Self { txt, ci }
    }

    fn next(&mut self) -> Option<(usize, usize)> {
        static FI: Lazy<Finder> = Lazy::new(|| Finder::new(b"\n"));

        // XXX: switching between these two lines makes a difference
        let &(start, _) = self.ci.peek()?;
        //let start = self.ci.peek().map(|&(idx, _)| idx)?;
        
        let matches = FI
            .find_iter(&self.txt.as_bytes()[start..])
            .map(|idx| idx + start);

        for next_match in matches {
            if self.txt.is_char_boundary(next_match) {
                self.ci.by_ref()
                    //.take_while(|&(idx, _)| idx < next_match)
                    .peeking_take_while(|&(idx, _)| idx <= next_match)
                    .for_each(drop);
                return Some((start, next_match))
            }
        }

        self.ci.by_ref().for_each(drop);
        Some((start, self.txt.len()))
    }
}

fn main() {
    let s = "1".repeat(20_000) + "\n";
    let sn = s.repeat(200_000);
    let mut v = Foo::new(&sn);
    let mut p = Vec::with_capacity(200_000);
    while let Some(r) = v.next() {
        p.push(r);
    }
}

Cargo.toml

[package]
name = "tt-fluctuations"
version = "0.1.0"
edition = "2024"

[dependencies]
itertools = "0.14.0"
memchr = "2.7.4"

[profile.release]
codegen-units = 1
lto = true
strip = true
Cargo.lock

# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4

[[package]]
name = "either"
version = "1.15.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719"

[[package]]
name = "itertools"
version = "0.14.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2b192c782037fadd9cfa75548310488aabdbf3d2da73885b31bd0abd03351285"
dependencies = [
 "either",
]

[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"

[[package]]
name = "tt-fluctuations"
version = "0.1.0"
dependencies = [
 "itertools",
 "memchr",
]

Commits used to test:

Besides the two direct regressions. The fact that alternating between these two lines may cause
a big difference in performance is curious:

let &(start, _) = self.ci.peek()?; // called without_map in bench
let start = self.ci.peek().map(|&(idx, _)| idx)?; // with_map

But maybe that's a different issue.

Summary of hyperfine run:

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

Full numbers

Benchmark 1: ./2162e9d_with_map
  Time (mean ± σ):      3.894 s ±  0.014 s    [User: 3.202 s, System: 0.682 s]
  Range (min … max):    3.876 s …  3.909 s    4 runs

Benchmark 2: ./2162e9d_without_map
  Time (mean ± σ):      4.380 s ±  0.034 s    [User: 3.686 s, System: 0.683 s]
  Range (min … max):    4.346 s …  4.419 s    4 runs

Benchmark 3: ./934880f586f_with_map
  Time (mean ± σ):      5.388 s ±  0.020 s    [User: 4.702 s, System: 0.671 s]
  Range (min … max):    5.369 s …  5.415 s    4 runs

Benchmark 4: ./934880f586f_without_map
  Time (mean ± σ):      7.448 s ±  0.018 s    [User: 6.758 s, System: 0.666 s]
  Range (min … max):    7.423 s …  7.465 s    4 runs

Benchmark 5: ./ce36a96_with_map
  Time (mean ± σ):      4.128 s ±  0.016 s    [User: 3.448 s, System: 0.666 s]
  Range (min … max):    4.111 s …  4.149 s    4 runs

Benchmark 6: ./ce36a96_without_map
  Time (mean ± σ):      6.448 s ±  0.007 s    [User: 5.734 s, System: 0.696 s]
  Range (min … max):    6.441 s …  6.457 s    4 runs

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

So building with current nightly and

let &(start, _) = self.ci.peek()?;

is almost twice as slow than building with stable and using

let start = self.ci.peek().map(|&(idx, _)| idx)?;

Version it worked on

It most recently worked on: 2162e9d

Version with regression

First regression

rustc --version --verbose:

rustc 1.87.0-nightly (ce36a966c 2025-02-17)
binary: rustc
commit-hash: ce36a966c79e109dabeef7a47fe68e5294c6d71e
commit-date: 2025-02-17
host: x86_64-unknown-linux-gnu
release: 1.87.0-nightly
LLVM version: 20.1.0

Second regression

rustc --version --verbose:

rustc 1.88.0-nightly (934880f58 2025-04-09)
binary: rustc
commit-hash: 934880f586f6ac1f952c7090e2a943fcd7775e7b
commit-date: 2025-04-09
host: x86_64-unknown-linux-gnu
release: 1.88.0-nightly
LLVM version: 20.1.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchP-highHigh priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.T-libsRelevant to the library team, which will review and decide on the PR/issue.regression-untriagedUntriaged performance or correctness regression.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions