Skip to content

Conversation

@fitzgen
Copy link
Member

@fitzgen fitzgen commented Nov 7, 2025

selects cannot be speculated through on some of our targets (e.g. x64) so strongly prefer not choosing them.

Using a target-specific cost function, so that we only pessimize select when it makes sense for the target, is left for follow up work and is tracked in #12005. FWIW, no CPUs on the market do value speculation today, as far as I am aware, so this particular case is somewhat hypothetical (but the larger goal of target-specific cost functions would still be useful).

`select`s cannot be speculated through on some of our targets (e.g. x64) so
strongly prefer not choosing them.

Using a target-specific cost function, so that we only pessimize `select` when
it makes sense, is left for follow up work and is tracked in bytecodealliance#12005.
@fitzgen fitzgen requested a review from a team as a code owner November 7, 2025 18:08
@fitzgen fitzgen requested review from alexcrichton and removed request for a team November 7, 2025 18:08
@cfallin
Copy link
Member

cfallin commented Nov 7, 2025

selects cannot be speculated through on some of our targets (e.g. x64) so strongly prefer not choosing them.

FWIW, I think this is losing a little nuance and the reality doesn't merit a cost of 50 (!). In more detail, select (cmove) works like any other multi-input operator on modern out-of-order CPUs; the main difference is that it has three inputs (flags/condition, source for the CMOVcc, old register value for the CMOVcc). When all three inputs are ready, the instruction will execute.

When folks say that "select isn't speculated through" what they mean is that the CPU doesn't have a condition predictor (like the branch predictor) that will allow the instruction to speculatively go without the condition input ready. It doesn't mean, however, that the instruction blocks all speculation or serves as a pipeline barrier/flush. That would be far worse!

Supporting source: Agner Fog's instruction latency tables show on a reasonable x86-64 baseline (Skylake, circa 2015), CMOVcc reg/reg form is 1 uop and has a latency of 1 cycle and a reciprocal throughput of 0.5 (so two CMOVccs can complete per cycle).

@alexcrichton alexcrichton requested review from cfallin and removed request for alexcrichton November 7, 2025 18:20
@fitzgen
Copy link
Member Author

fitzgen commented Nov 7, 2025

@cfallin I was indeed not thinking in a nuanced way about this, thanks for the clarifications/reality check.

I removed the select special case, so it will hit the default cost, which will be slightly greater than the cost of adds/etc.

I left imul at cost 10 though.

@fitzgen fitzgen changed the title Cranelift: Tweak cost function to pessimize select opcodes Cranelift: Tweak cost function to make imul more expensive Nov 7, 2025
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; LGTM!

@cfallin cfallin added this pull request to the merge queue Nov 7, 2025
Merged via the queue into bytecodealliance:main with commit d5ef528 Nov 7, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants