128-bit arithmetic

 # Summary

Provide limited instructions for 128-bit integer arithmetic.

# Motivation

Two use cases of 128-bit arithmetic known are:

- Arbitrary-precision integer arithmetic with 64-bit limbs. Used among other places in the implementation of public-key cryptography such as RSA.
- Efficiently checking for overflow on 64-bit arithmetic.

Some benchmarks are below but currently as-compiled to wasm today these use cases are 2-7x slower than native code today. In native code 128-bit operations use tight control over the native flags register to achieve much better performance than wasm is able to express today. 

The major goal of this proposal is to close the peformance gap between native and wasm in these use cases where 128-bit arithmetic is in a hot loop.

# Proposal

Add the following new instructions with a type signature of `[i64 i64 i64 i64] -> [i64 i64]`. Each instruction represents 128-bit integers as a pair of `i64`, but otherwise is identical to the corresponding narrower operation.

- `i64.add128`, extended from `i64.add`
- `i64.sub128`, extended from `i64.sub`
- `i64.mul128`, extended from `i64.mul`

# Example

This function does a 64-bit integer add and returns both the 64-bit result and the unsigned carry, roughly equivalent to Rust's [`u64::overflowing_add`](https://doc.rust-lang.org/std/primitive.u64.html#method.overflowing_add):

```
(module
  (func (param i64 i64) (result i64 i64)
        local.get 0
        i64.const 0
        local.get 1
        i64.const 0
        i64.add128
    )
)
```

# Rationale and Alternatives

We evaluated several alternatives and so far this one appears to be the simplest to specify and implement, while also giving the best real-world performance in our initial benchmarks.

We first considered adding an "overflow flag" output to a 64-bit arithmetic instructions, such as `i64.{add,sub,mul}_overflow_{u,s}` of type `[i64 i64] -> [i64 i32]`, where the `i32` result is a 0/1 boolean indicating whether the addition overflowed. This style of instruction is what native platforms often have for implementing these 128-bit operations so this seemed like a possibility of exposing directly to wasm.

We also considered a "carry flag" on both input and output, such as `i64.{add,sub}_with_carry_{u,s}` of type `[i64 i64 i32] -> [i64 i32]`. This is, for example, the `adc` instruction on x64 and how 128-bit addition is implemented.

We found that it was nontrivial-to-difficult to generate good code from these instructions, as shown in the benchmark results below.

- A naive implementation of the instructions repeatedly moved the carry flag between a general-purpose register and the flags register. Avoiding this movement would require relatively advanced instruction fusing. Even then, loops typically require instructions that clobber flags, so such optimizations would only apply to straightline code, which we suspect is not the common case for these operations. Our experiments were on x86 but we suspect this observation applies to most current CPU instruction sets.
- Another difficulty is that these instructions produce multiple results, which we have observed are difficult to integrate into many optimization frameworks. The benchmarks show below that these sorts of optimizations are likely to be crucial for performance, and if not implemented the new instructions are detrimental to performance.

In contrast, with 128-bit operations the need to fuse with other instructions is much lower and it seems that "simply having support" is a pretty good sweet-spot both performance and complexity-wise.

In addition, we considered a "wide multiply" `i64.mul_wide_{u,s}` with type `[i64 i64] -> [i64 i64]` that takes two `i64` inputs and produces both the low and high halves of the 128-bit result. This is the minimal operation needed for arbitrary-precision multiplication. It is straightforward to build a full 128-bit multiply from such an instruction, and the naive implementation of this instruction performed better than the `i64.mul128` instruction benchmarks below. This maps closely to the x64 `mul` instruction as well as equivalent instructions on AArch64 for example. While this "primitive" worked for multiplication the corresponding primitive for addition/subtraction did not work out (the `*_overflow_*` instruction above), so we opted to add `i64.mul128` instead of this instruction.

Another possible alternative is to add a native `i128` type to WebAssembly. Given the relatively niche nature of 128-bit integers, however, coupled with the pretty good code compilers already emit for 128-bit operations in source languages, we see this alternative as far too heavyweight for the benefits that it could bring. By just adding a few specialized instructions for the slow use cases it's possible to get much closer to native performance than wasm is at today.

# Benchmark Results

This proposal has been prototyped in LLVM, Wasmtime/Cranelift, and Rust (all just living in forks for now) with the intention of providing guidance of the direction this proposal should go in. To that effect many of the above alternatives, as well as this proposal itself, were all implemented and evaluated.

Each benchmark below frobs various custom flags in LLVM to ensure that only a certain alternative of this proposal is implemented. Output was hand-verified to contain the desired instructions and profiling confirmed that Cranelift was generating the expected instruction sequences.

Benchmarks were collected on `Intel(R) Core(TM) i9-14900K` on a Linux machine. Other platforms will be tested later if this proposal advances through the CG stages.

## Benchmark: addition/subtraction

For this benchmark the [Rust `num-bigint` crate](https://github.com/dignifiedquire/num-bigint) was used. In the Rust source for this benchmark, each pair of limbs is first cast from `u64` to `u128`, then added, and finally split into the low and high `u64` halves. In my experience this is a common pattern for arbitrary-precision arithmetic implementations in both C and Rust.

We compiled the benchmark to native x86-64 (to use as a baseline), to current WebAssembly, and to each of three variations on the proposed extension. We compared runtime for each variant of WebAssembly against the native code baseline and report how much slower than native each variant is (lower is better):

- current wasm - **125%**
- `i64.add_overflow_u` - **88%**
- `i64.add_with_carry_u` - **155%**
- `i64.add128` (this proposal) - **18%**

Here current WebAssembly starts at 125% slower than native and the 128-bit ops in this proposal were able to close that gap to 18% slower than native. Other shapes of this proposal, as naively implemented in LLVM/Cranelift, would require more advanced optimizations to get closer to native performance specifically around management of the flags register.

## Benchmark: multiplication

For this benchmark the [`blind-sig` benchmark](https://github.com/bytecodealliance/sightglass/tree/main/benchmarks/blind-sig) in the Sightglass repository was selected to test. This benchmarks is bottlenecked on the `__multi3` helper in compiler-rt in LLVM which is a library implementation of 128-bit multiplication compiled to WebAssembly. 

Benchmarking is performed as above where numbers here are reported as slowdowns relative to native code where lower is better:

- current wasm - **637%**
- `i64.mul_wide_u` - **100%**
- `i64.mul128` (this proposal) - **125%**

Here the 637% slower-than-native gap was closed to ~100%. The alternative instruction `i64.mul_wide_u` was the best performing but this is because Wasmtime is missing some easy-to-implement optimizations for full 128-bit multiplication (e.g. here the upper halves of the 128-bit numbers are always zero). We're confident with a minor amount of elbow-grease the proposed `i64.mul128` instruction can be as good as `i64.mul_wide_u`.

# Open Questions

Are there other use cases where instructions like these would be useful?

Should we introduce an `i128` type and require that all integer operations be implemented for it, or stick with the most useful subset and represent values as pairs of `i64`, as proposed above?

Are there other motivating use cases for `*_overflow_*` or `*_with_carry_*` instructions? Ones that perhaps won't require "advanced flag optimizations" to showcase their benefit?

# Meta

This proposal is a join effort of @alexcrichton (Fermyon) and @jameysharp (Fastly). We intend to start discussion with this issue and then bring a more complete proposal to a future CG meeting.

This issue is also related to historical discussions such as:

* https://github.com/WebAssembly/design/issues/1021
* https://github.com/WebAssembly/design/issues/1472
* https://github.com/WebAssembly/design/issues/1495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

128-bit arithmetic #1522

Summary

Motivation

Proposal

Example

Rationale and Alternatives

Benchmark Results

Benchmark: addition/subtraction

Benchmark: multiplication

Open Questions

Meta

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

128-bit arithmetic #1522

Description

Summary

Motivation

Proposal

Example

Rationale and Alternatives

Benchmark Results

Benchmark: addition/subtraction

Benchmark: multiplication

Open Questions

Meta

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions