-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Proposal
Problem statement
Carryless product (also known as XOR multiplication) is backed by hardware instructions on many platforms, but there's no way for users to write code which uses these instructions without accessing them through hardware intrinsics. This results in users having to write more platform specific code or rely on less optimized, scalar expressions of the operation for portable applications.
Motivating examples or use cases
Carryless product can be used to accelerate checksums (like CRC/cyclic redundancy checks) and aid in implementing some cryptographic algorithms. Since many left shifts and XORs are executed in parallel, there are also applications of it speeding up bit manipulation operations. In the links section are some example applications: one showing the acceleration of JSON parsing by finding pairs of quotation marks using carryless multiplication to perform a prefix XOR, another showing accelerated Morton code/Z-order curve calculations.
This operation can also be referred to as "polynomial multiplication" since it's equivalent to the multiplication of two polynomials in GF(2), the finite field with 2 elements (polynomials can be represented as bitstrings).
Solution sketch
Intrinsic support was recently added in LLVM 22 and has the following semantics according to the LangRef:
The ‘llvm.clmul’ intrinsic computes carry-less multiply of its arguments, which is the result of applying the standard multiplication algorithm, where all of the additions are replaced with XORs, and returns the low-bits. The vector variants operate lane-wise.
A widening variant is also proposed which returns a tuple of (low_bits, high_bits).
impl uN {
/// Calculates the carryless product of `self` and `rhs`, returning the low-order bits.
///
/// Instead of addition, XOR (exclusive-or) is applied as the reduction operation.
///
/// If the high-order bits are needed, see [`Self::widening_carryless_mul`].
/// ```
/// let x = 0b1101;
/// let y = 0b0110;
/// let z = 0b1111;
/// assert_eq!(0b00101110, x.carryless_mul(y));
/// assert_eq!(0b01010101, z.carryless_mul(z));
/// ```
pub const fn carryless_mul(self, rhs: Self) -> Self;
/// Calculates the carryless product of `self` and `rhs`, returning a tuple of low-order
/// bits and high-order bits in that order.
///
/// Instead of addition, XOR (exclusive-or) is applied as the reduction operation.
///
/// If only the low-order bits are needed, consider using [`Self::carryless_mul`].
/// ```
/// let x = 0b10100010;
/// let y = 0b10010110;
/// let z = 0b11111111;
/// let (lo, hi) = x.widening_carryless_mul(y);
/// assert_eq!(0b11101100, lo);
/// assert_eq!(0b01011000, hi);
/// assert_eq!((0b01010101, 0b01010101), z.widening_carryless_mul(z));
/// ```
pub const fn widening_carryless_mul(self, rhs: Self) -> (Self, Self);
}Alternatives
- Do nothing, rely on the compiler recognizing scalar implementations of these patterns or require users to use intrinsics.
- Wait until LLVM 23 for implementing this. More widespread use of this intrinsic in LLVM optimizations and architecture backends won't reach Rust until LLVM 23 since the intrinsic landed not long before the version branch.
- Should the widening variant instead return the next integer size larger, and not be implemented for
u128? Sou8->u16,u16->u32,u32->u64, but nothing foru128.
Links and related work
https://llvm.org/docs/LangRef.html#llvm-clmul-intrinsic
Jan Schultke, active C++29 proposal - Carry-less product: std::clmul
Hardware instructions
x86: pclmulqdq, wikipedia:CLMUL instruction set
Arm NEON: pmull/pmull2 for low and high bits, respectively
RISC-V: clmul/clmulh/clmulr(reversed), the Zbc extension
SPARC: XMULX/XMULXHI, "XOR multiply"
POWER: vpmsum, "vector polynomial multiply sum"
Applications/articles
Geoff Langdale - Finding quote pairs with carry-less multiply
Wunkolo - pclmulqdq Tricks
Intel white paper - Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.