Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Feb 2, 2026

This is an alternative to #3366, but has larger ramifications for Comet's published binaries. I suspect this will provide a better experience for users trying out Comet binaries in a cloud environment in 2026.

Which issue does this PR close?

N/A.

Rationale for this change

The release Docker builds (core-amd64-libs and core-arm64-libs) previously used no explicit target-cpu, relying on rustc defaults. This produces binaries targeting a minimal baseline that may miss important optimizations.

When running on a Graviton2 CPU, for example, we see significant time spent in atomics that should have been optimized by LSE instructions.

For x86-64, x86-64-v3 is the appropriate baseline - it covers AVX2/FMA which are available on all modern cloud instances (Haswell-era and newer, ~2013+).

For ARM64, neoverse-n1 is the appropriate baseline - it covers AWS Graviton2, Ampere Altra, and newer cloud ARM processors from Azure and GCP. Key features include LSE atomics which significantly improve concurrent workload performance. Code compiled for neoverse-n1 also runs on Apple Silicon since M1 is a superset. Newer neoverse targets are likely too aggressive for released binaries, but n1 (2019) is a good start.

What changes are included in this PR?

Updates Makefile to set explicit CPU targets for release builds:

  • core-amd64-libs: adds -Ctarget-cpu=x86-64-v3
  • core-arm64-libs: adds -Ctarget-cpu=neoverse-n1
  • core-amd64: updates Linux target from haswell to x86-64-v3
  • core-arm64: updates Linux target from native to neoverse-n1

Development targets (core, release, bench) continue to use native for optimal local performance. x86-64-v3 was chosen over x86-64-v4 because v4 requires AVX-512, which is unavailable on AMD Zen 1-3 (EPYC Rome/Milan) and older Intel chips. Most current cloud instances use these processors.

How are these changes tested?

These are build configuration changes. The release Docker build process (dev/release/build-release-comet.sh) will use the new targets. The binaries produced will be compatible with a wider range of cloud instances while still enabling important optimizations like AVX2 (x86) and LSE atomics (ARM). I will do a build and dig into the instructions.

… -Ctarget-feature=-prefer-256-bit which doesn't make sense for Skylake, and on native it's unclear what effect it's supposed to have.
@mbutrovich mbutrovich changed the title chore: update target-cpus to modern baselines chore: update target-cpus to common baselines Feb 2, 2026
@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.99%. Comparing base (f09f8af) to head (361e567).
⚠️ Report is 918 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3368      +/-   ##
============================================
+ Coverage     56.12%   59.99%   +3.86%     
- Complexity      976     1475     +499     
============================================
  Files           119      175      +56     
  Lines         11743    16165    +4422     
  Branches       2251     2681     +430     
============================================
+ Hits           6591     9698    +3107     
- Misses         4012     5114    +1102     
- Partials       1140     1353     +213     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mbutrovich
Copy link
Contributor Author

I asked Claude to summarize the net benefit of this PR based on how rustc chooses targets when not given a target-cpu argument:

Target CPU comparison

Architecture Before (default) After (this PR)
x86_64 Linux x86-64 x86-64-v3
aarch64 Linux generic neoverse-n1

Key features gained

Feature x86-64 (before) x86-64-v3 (after) Benefit
AVX Wider SIMD
AVX2 256-bit integer SIMD
FMA Fused multiply-add
BMI1/BMI2 Bit manipulation
Feature generic (before) neoverse-n1 (after) Benefit
LSE atomics Single-instruction atomics, no LL/SC loops
AES/SHA2 Hardware crypto
CRC Hardware CRC32
dotprod Dot product acceleration

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Feb 3, 2026

I ran the release script on main and this PR, and ran objdump to look for relevant instructions:

# x86-64
objdump -d amd64/libcomet.so | grep -cE 'vfmadd|vfmsub|vfnmadd|vfnmsub'              # FMA
objdump -d amd64/libcomet.so | grep -cE '\bv[a-z]+ps\b|\bv[a-z]+pd\b'                # AVX2
objdump -d amd64/libcomet.so | grep -cE 'tzcnt|lzcnt|andn|bextr|bzhi|pdep|pext'      # BMI1/2

# ARM64
objdump -d aarch64/libcomet.so | grep -cE '\bldadd|\bstadd|\bswp|\bcas\b|\bldclr|\bldset'  # LSE atomics
objdump -d aarch64/libcomet.so | grep -cE 'bl.*__aarch64_'                           # Outline atomic calls
objdump -d aarch64/libcomet.so | grep -cE 'aese|aesd|aesmc|aesimc'                   # AES
objdump -d aarch64/libcomet.so | grep -cE '\bsdot\b|\budot\b'                        # Dot product

I also included the binary size:

x86-64 (baseline → x86-64-v3)

Metric Baseline (main) PR #3368
Binary size 94 MB 94 MB
FMA 2 2
AVX2 994 666,833
BMI1/2 10,295 14,439

ARM64 (baseline → neoverse-n1)

Metric Baseline (main) PR #3368
Binary size 86 MB 87 MB
LSE atomics 24 77,999
Outline atomic calls 79,533 166
AES 4,911 4,911
Dot product 0 26

We saw a handful of specialized instructions before because some libraries implement architecture specific implementations and dispatch at runtime, but now we're getting these instructions in a lot more places.

@mbutrovich mbutrovich changed the title chore: update target-cpus to common baselines chore: update target-cpus in release script to common baselines Feb 3, 2026
@mbutrovich mbutrovich changed the title chore: update target-cpus in release script to common baselines chore: update target-cpus in published release targets to common baselines Feb 3, 2026
@mbutrovich mbutrovich marked this pull request as ready for review February 3, 2026 12:05
@mbutrovich mbutrovich requested review from andygrove and parthchandra and removed request for andygrove and parthchandra February 3, 2026 12:28
@mbutrovich
Copy link
Contributor Author

mbutrovich commented Feb 3, 2026

A couple of thoughts as I mark this ready for review:

  1. This slowed down the build-release-comet.sh slightly, but not egregiously. It's still a "run it and go make lunch" script in my mind.
  2. These new targets are not tested in CI. CI runs its builds as the generic "before" scenario in my tables. I tested changing them to match these targets, and the initial Rust library build went from a few minutes to 10-15 minutes in CI. Test runs went a little bit faster. I'm actually of the mind that that is an acceptable scenario to get the instructions tested that we intend to ship as binaries, but that could be a followup PR.

# build native libs for amd64 architecture Linux/MacOS on a Linux/amd64 machine/container
core-amd64-libs:
cd native && cargo build -j 2 --release $(FEATURES_ARG)
cd native && RUSTFLAGS="-Ctarget-cpu=x86-64-v3" cargo build -j 2 --release $(FEATURES_ARG)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a target for released binaries (build-release-comet.sh).

# build native libs for arm64 architecture Linux/MacOS on a Linux/arm64 machine/container
core-arm64-libs:
cd native && cargo build -j 2 --release $(FEATURES_ARG)
cd native && RUSTFLAGS="-Ctarget-cpu=neoverse-n1" cargo build -j 2 --release $(FEATURES_ARG)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a target for released binaries (build-release-comet.sh).

core-amd64:
rustup target add x86_64-apple-darwin
cd native && RUSTFLAGS="-Ctarget-cpu=skylake -Ctarget-feature=-prefer-256-bit" CC=o64-clang CXX=o64-clang++ cargo build --target x86_64-apple-darwin --release $(FEATURES_ARG)
cd native && RUSTFLAGS="-Ctarget-cpu=skylake" CC=o64-clang CXX=o64-clang++ cargo build --target x86_64-apple-darwin --release $(FEATURES_ARG)
Copy link
Contributor Author

@mbutrovich mbutrovich Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag doesn't make sense on this target (Skylake), so I suspect it's mistakenly copy-pasted from elsewhere.

mkdir -p common/target/classes/org/apache/comet/darwin/x86_64
cp native/target/x86_64-apple-darwin/release/libcomet.dylib common/target/classes/org/apache/comet/darwin/x86_64
cd native && RUSTFLAGS="-Ctarget-cpu=haswell -Ctarget-feature=-prefer-256-bit" cargo build --release $(FEATURES_ARG)
cd native && RUSTFLAGS="-Ctarget-cpu=x86-64-v3" cargo build --release $(FEATURES_ARG)
Copy link
Contributor Author

@mbutrovich mbutrovich Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for local development I guess, but made consistent. Also this flag doesn't make sense on this target (Haswell), so I suspect it's mistakenly copy-pasted from elsewhere.

mkdir -p common/target/classes/org/apache/comet/darwin/aarch64
cp native/target/aarch64-apple-darwin/release/libcomet.dylib common/target/classes/org/apache/comet/darwin/aarch64
cd native && RUSTFLAGS="-Ctarget-cpu=native" cargo build --release $(FEATURES_ARG)
cd native && RUSTFLAGS="-Ctarget-cpu=neoverse-n1" cargo build --release $(FEATURES_ARG)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for local development I guess, but made consistent.

release-linux: clean
rustup target add aarch64-apple-darwin x86_64-apple-darwin
cd native && RUSTFLAGS="-Ctarget-cpu=apple-m1" CC=arm64-apple-darwin21.4-clang CXX=arm64-apple-darwin21.4-clang++ CARGO_FEATURE_NEON=1 cargo build --target aarch64-apple-darwin --release $(FEATURES_ARG)
cd native && RUSTFLAGS="-Ctarget-cpu=skylake -Ctarget-feature=-prefer-256-bit" CC=o64-clang CXX=o64-clang++ cargo build --target x86_64-apple-darwin --release $(FEATURES_ARG)
Copy link
Contributor Author

@mbutrovich mbutrovich Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag doesn't make sense on this target (Skylake), so I suspect it's mistakenly copy-pasted from elsewhere.

@mbutrovich mbutrovich changed the title chore: update target-cpus in published release targets to common baselines chore: update target-cpus in published binaries to new baselines Feb 3, 2026
@mbutrovich mbutrovich changed the title chore: update target-cpus in published binaries to new baselines chore: update target-cpus in published binaries to new x86-64-v3 and neoverse-n1 Feb 3, 2026
@mbutrovich mbutrovich changed the title chore: update target-cpus in published binaries to new x86-64-v3 and neoverse-n1 chore: update target-cpus in published binaries to x86-64-v3 and neoverse-n1 Feb 3, 2026
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich. This makes sense to me.

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Feb 3, 2026

Note that even with x86-64-v3 target, we still get native AES intrinsics for encryption/decryption in the Ring crate because they embed multiple implementations and dispatch at runtime.

I counted them in the binary:

  • 889 AES-NI instructions (aesenc, aesdec, etc.)
  • 99 pclmulqdq instructions (for AES-GCM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants