Skip to content

ACP: Add carryless product for unsigned integers #738

@okaneco

Description

@okaneco

Proposal

Problem statement

Carryless product (also known as XOR multiplication) is backed by hardware instructions on many platforms, but there's no way for users to write code which uses these instructions without accessing them through hardware intrinsics. This results in users having to write more platform specific code or rely on less optimized, scalar expressions of the operation for portable applications.

Motivating examples or use cases

Carryless product can be used to accelerate checksums (like CRC/cyclic redundancy checks) and aid in implementing some cryptographic algorithms. Since many left shifts and XORs are executed in parallel, there are also applications of it speeding up bit manipulation operations. In the links section are some example applications: one showing the acceleration of JSON parsing by finding pairs of quotation marks using carryless multiplication to perform a prefix XOR, another showing accelerated Morton code/Z-order curve calculations.

This operation can also be referred to as "polynomial multiplication" since it's equivalent to the multiplication of two polynomials in GF(2), the finite field with 2 elements (polynomials can be represented as bitstrings).

Solution sketch

Intrinsic support was recently added in LLVM 22 and has the following semantics according to the LangRef:

The ‘llvm.clmul’ intrinsic computes carry-less multiply of its arguments, which is the result of applying the standard multiplication algorithm, where all of the additions are replaced with XORs, and returns the low-bits. The vector variants operate lane-wise.

A widening variant is also proposed which returns a tuple of (low_bits, high_bits).

impl uN {
    /// Calculates the carryless product of `self` and `rhs`, returning the low-order bits.
    ///
    /// Instead of addition, XOR (exclusive-or) is applied as the reduction operation.
    ///
    /// If the high-order bits are needed, see [`Self::widening_carryless_mul`].
    /// ```
    ///     let x = 0b1101;
    ///     let y = 0b0110;
    ///     let z = 0b1111;
    ///     assert_eq!(0b00101110, x.carryless_mul(y));
    ///     assert_eq!(0b01010101, z.carryless_mul(z));
    /// ```
    pub const fn carryless_mul(self, rhs: Self) -> Self;

    /// Calculates the carryless product of `self` and `rhs`, returning a tuple of low-order
    /// bits and high-order bits in that order.
    ///
    /// Instead of addition, XOR (exclusive-or) is applied as the reduction operation.
    ///
    /// If only the low-order bits are needed, consider using [`Self::carryless_mul`].
    /// ```
    ///     let x = 0b10100010;
    ///     let y = 0b10010110;
    ///     let z = 0b11111111;
    ///     let (lo, hi) = x.widening_carryless_mul(y);
    ///     assert_eq!(0b11101100, lo);
    ///     assert_eq!(0b01011000, hi);
    ///     assert_eq!((0b01010101, 0b01010101), z.widening_carryless_mul(z));
    /// ```
    pub const fn widening_carryless_mul(self, rhs: Self) -> (Self, Self); 
}

Alternatives

  1. Do nothing, rely on the compiler recognizing scalar implementations of these patterns or require users to use intrinsics.
  2. Wait until LLVM 23 for implementing this. More widespread use of this intrinsic in LLVM optimizations and architecture backends won't reach Rust until LLVM 23 since the intrinsic landed not long before the version branch.
  3. Should the widening variant instead return the next integer size larger, and not be implemented for u128? So u8->u16, u16->u32, u32->u64, but nothing for u128.

Links and related work

https://llvm.org/docs/LangRef.html#llvm-clmul-intrinsic
Jan Schultke, active C++29 proposal - Carry-less product: std::clmul

Hardware instructions

x86: pclmulqdq, wikipedia:CLMUL instruction set
Arm NEON: pmull/pmull2 for low and high bits, respectively
RISC-V: clmul/clmulh/clmulr(reversed), the Zbc extension
SPARC: XMULX/XMULXHI, "XOR multiply"
POWER: vpmsum, "vector polynomial multiply sum"

Applications/articles

Geoff Langdale - Finding quote pairs with carry-less multiply
Wunkolo - pclmulqdq Tricks
Intel white paper - Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    T-libs-apiapi-change-proposalA proposal to add or alter unstable APIs in the standard libraries

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions