Description
Summary
Provide limited instructions for 128-bit integer arithmetic.
Motivation
Two use cases of 128-bit arithmetic known are:
- Arbitrary-precision integer arithmetic with 64-bit limbs. Used among other places in the implementation of public-key cryptography such as RSA.
- Efficiently checking for overflow on 64-bit arithmetic.
Some benchmarks are below but currently as-compiled to wasm today these use cases are 2-7x slower than native code today. In native code 128-bit operations use tight control over the native flags register to achieve much better performance than wasm is able to express today.
The major goal of this proposal is to close the peformance gap between native and wasm in these use cases where 128-bit arithmetic is in a hot loop.
Proposal
Add the following new instructions with a type signature of [i64 i64 i64 i64] -> [i64 i64]
. Each instruction represents 128-bit integers as a pair of i64
, but otherwise is identical to the corresponding narrower operation.
i64.add128
, extended fromi64.add
i64.sub128
, extended fromi64.sub
i64.mul128
, extended fromi64.mul
Example
This function does a 64-bit integer add and returns both the 64-bit result and the unsigned carry, roughly equivalent to Rust's u64::overflowing_add
:
(module
(func (param i64 i64) (result i64 i64)
local.get 0
i64.const 0
local.get 1
i64.const 0
i64.add128
)
)
Rationale and Alternatives
We evaluated several alternatives and so far this one appears to be the simplest to specify and implement, while also giving the best real-world performance in our initial benchmarks.
We first considered adding an "overflow flag" output to a 64-bit arithmetic instructions, such as i64.{add,sub,mul}_overflow_{u,s}
of type [i64 i64] -> [i64 i32]
, where the i32
result is a 0/1 boolean indicating whether the addition overflowed. This style of instruction is what native platforms often have for implementing these 128-bit operations so this seemed like a possibility of exposing directly to wasm.
We also considered a "carry flag" on both input and output, such as i64.{add,sub}_with_carry_{u,s}
of type [i64 i64 i32] -> [i64 i32]
. This is, for example, the adc
instruction on x64 and how 128-bit addition is implemented.
We found that it was nontrivial-to-difficult to generate good code from these instructions, as shown in the benchmark results below.
- A naive implementation of the instructions repeatedly moved the carry flag between a general-purpose register and the flags register. Avoiding this movement would require relatively advanced instruction fusing. Even then, loops typically require instructions that clobber flags, so such optimizations would only apply to straightline code, which we suspect is not the common case for these operations. Our experiments were on x86 but we suspect this observation applies to most current CPU instruction sets.
- Another difficulty is that these instructions produce multiple results, which we have observed are difficult to integrate into many optimization frameworks. The benchmarks show below that these sorts of optimizations are likely to be crucial for performance, and if not implemented the new instructions are detrimental to performance.
In contrast, with 128-bit operations the need to fuse with other instructions is much lower and it seems that "simply having support" is a pretty good sweet-spot both performance and complexity-wise.
In addition, we considered a "wide multiply" i64.mul_wide_{u,s}
with type [i64 i64] -> [i64 i64]
that takes two i64
inputs and produces both the low and high halves of the 128-bit result. This is the minimal operation needed for arbitrary-precision multiplication. It is straightforward to build a full 128-bit multiply from such an instruction, and the naive implementation of this instruction performed better than the i64.mul128
instruction benchmarks below. This maps closely to the x64 mul
instruction as well as equivalent instructions on AArch64 for example. While this "primitive" worked for multiplication the corresponding primitive for addition/subtraction did not work out (the *_overflow_*
instruction above), so we opted to add i64.mul128
instead of this instruction.
Another possible alternative is to add a native i128
type to WebAssembly. Given the relatively niche nature of 128-bit integers, however, coupled with the pretty good code compilers already emit for 128-bit operations in source languages, we see this alternative as far too heavyweight for the benefits that it could bring. By just adding a few specialized instructions for the slow use cases it's possible to get much closer to native performance than wasm is at today.
Benchmark Results
This proposal has been prototyped in LLVM, Wasmtime/Cranelift, and Rust (all just living in forks for now) with the intention of providing guidance of the direction this proposal should go in. To that effect many of the above alternatives, as well as this proposal itself, were all implemented and evaluated.
Each benchmark below frobs various custom flags in LLVM to ensure that only a certain alternative of this proposal is implemented. Output was hand-verified to contain the desired instructions and profiling confirmed that Cranelift was generating the expected instruction sequences.
Benchmarks were collected on Intel(R) Core(TM) i9-14900K
on a Linux machine. Other platforms will be tested later if this proposal advances through the CG stages.
Benchmark: addition/subtraction
For this benchmark the Rust num-bigint
crate was used. In the Rust source for this benchmark, each pair of limbs is first cast from u64
to u128
, then added, and finally split into the low and high u64
halves. In my experience this is a common pattern for arbitrary-precision arithmetic implementations in both C and Rust.
We compiled the benchmark to native x86-64 (to use as a baseline), to current WebAssembly, and to each of three variations on the proposed extension. We compared runtime for each variant of WebAssembly against the native code baseline and report how much slower than native each variant is (lower is better):
- current wasm - 125%
i64.add_overflow_u
- 88%i64.add_with_carry_u
- 155%i64.add128
(this proposal) - 18%
Here current WebAssembly starts at 125% slower than native and the 128-bit ops in this proposal were able to close that gap to 18% slower than native. Other shapes of this proposal, as naively implemented in LLVM/Cranelift, would require more advanced optimizations to get closer to native performance specifically around management of the flags register.
Benchmark: multiplication
For this benchmark the blind-sig
benchmark in the Sightglass repository was selected to test. This benchmarks is bottlenecked on the __multi3
helper in compiler-rt in LLVM which is a library implementation of 128-bit multiplication compiled to WebAssembly.
Benchmarking is performed as above where numbers here are reported as slowdowns relative to native code where lower is better:
- current wasm - 637%
i64.mul_wide_u
- 100%i64.mul128
(this proposal) - 125%
Here the 637% slower-than-native gap was closed to ~100%. The alternative instruction i64.mul_wide_u
was the best performing but this is because Wasmtime is missing some easy-to-implement optimizations for full 128-bit multiplication (e.g. here the upper halves of the 128-bit numbers are always zero). We're confident with a minor amount of elbow-grease the proposed i64.mul128
instruction can be as good as i64.mul_wide_u
.
Open Questions
Are there other use cases where instructions like these would be useful?
Should we introduce an i128
type and require that all integer operations be implemented for it, or stick with the most useful subset and represent values as pairs of i64
, as proposed above?
Are there other motivating use cases for *_overflow_*
or *_with_carry_*
instructions? Ones that perhaps won't require "advanced flag optimizations" to showcase their benefit?
Meta
This proposal is a join effort of @alexcrichton (Fermyon) and @jameysharp (Fastly). We intend to start discussion with this issue and then bring a more complete proposal to a future CG meeting.
This issue is also related to historical discussions such as: