Description
Feature Request
Although the code for parsing decimals is quite nice already, a fair amount of work I've done on lexical (my Rust float parser) shows that when the slow path is invoked, we can get quite faster performance in a few cases, and a minimal version of this can prove quite easy, readable, and correct.
This is due to the use of big-integer arithmetic and a few algorithms optimized for big integer math for parsing decimal strings, which minimizes the number of operations relative to the decimal class in fast_float. There is more code involved, however, this code relies on fewer magic numbers and instead uses simple, big-integer algorithms instead. The only additional static storage required is ~32 64-bit integers, which is considerably smaller than the decimal implementation.
This performance gap is quite noticeable with floats like "8.988465674311580536566680e307"
, where the performance differences goes from ~500ns/iter (lexical) to ~14.465us/iter (fast_float). For denormal floats or those with negative exponents like "8.442911973260991817129021e-309"
, the performance is slightly better, from ~58.073 ns/iter (lexical) ~69.264ns/iter (fast_float). This scales very well also with input size: scaling well to 768 digits (767, the max for a double-precision float, with 1 extra) and beyond.
The actual implementation is quite simple, in fact, it uses a fairly minimal subset of big-integer arithmetic, which can be implemented in <700 lines of code, and then the parsing algorithms are quite easy to understand and implement. If this is of interest, I'd be happy to submit a PR.
Positive Exponent Implementation
Here is a Rust version of the code required to implement this, obviously, this would be translated to C++:
/// Generate the significant digits with a positive exponent relative to mantissa.
pub fn positive_digit_comp<F: RawFloat>(
mut bigmant: Bigint,
exponent: i32,
) -> ExtendedFloat80 {
// Simple, we just need to multiply by the power of the radix.
// Now, we can calculate the mantissa and the exponent from this.
// The binary exponent is the binary exponent for the mantissa
// shifted to the hidden bit.
bigmant.pow(10, exponent as u32);
// Get the exact representation of the float from the big integer.
// Himant checks **all** the remaining bits after the mantissa,
// so it will check if **any** truncated digits exist.
let (mant, is_truncated) = bigmant.hi64();
let exp = bigmant.bit_length() as i32 - 64 + F::EXPONENT_BIAS;
let mut fp = ExtendedFloat80 {
mant,
exp,
};
// Shift the digits into position and determine if we need to round-up.
shared::round::<F, _>(&mut fp, |f, s| {
shared::round_nearest_tie_even(f, s, |is_odd, is_halfway, is_above| {
is_above || (is_halfway && is_truncated) || (is_odd && is_halfway)
});
});
fp
}
Here, the exponent is relative to the significant digits, and where ExtendedFloat80
is just AdjustedMantissa
or an 80-bit extended-precision float with a biased exponent. Our bigmant
is just the significant digits parsed as a big integer. hi64
is a function that just gets the high 64-bits from big integer (for the significant digits), and checks if any bits below are truncated (for rounding), and EXPONENT_BIAS
is 1075
for a 64-bit float.
This implementation is quite simple: first we get the max power of the radix (in this case, 10) that can be stored in a limb, or 10^N <= 2^BITS - 1
. For 32-bit limbs, this is 9, for 64-bit limbs, this is 19. Next, we get the max, native integer for this power (or 10^9 for 32-bit limbs, 10^19 for 64-bit limbs). Then, we parse the maximum number of digits we can into a native limb, then add the limb to the big integer, or effectively the following logic:
We then parse using the following logic:
let mut counter: usize = 0;
let mut count: usize = 0;
let mut value: Limb = 0;
let step = ...;
let mut result = Bigint::new();
// Check if we've reached our max native value.
for &digit in digits {
// Add our temporary values.
value *= 10;
value += digit - 0x30;
counter += 1;
count += 1;
if counter == step {
result.mul_small(max_native);
result.add_small(value);
counter = 0;
value = 0;
}
// Check if we've exhausted our max digits.
if count == max_digits {
// Add temporary...
...
}
}
In total, we need operations for the following:
- Scalar add and mul, with the ability to detect overflow or the carry.
- Addition and multiplication of a scalar to a big integer.
- Grade-school multiplication of two big integers (asymptotically faster algorithms aren't applicable, since the inputs are too small).
- SHL operating for the big integer, both shifting bits and limbs.
- An efficient big-integer power algorithm.
The entirety of the big integer algorithms is <700 lines of code, although my version is extensively documented. The only one that here that's remotely interesting is the power algorithm, which uses a single pre-computed large power-of-5, and a a small number of pre-computed small powers. This is effectively just:
pub const SMALL_INT_POW5: [u64; 28] = [
1,
5,
25,
125,
625,
3125,
15625,
78125,
390625,
1953125,
9765625,
48828125,
244140625,
1220703125,
6103515625,
30517578125,
152587890625,
762939453125,
3814697265625,
19073486328125,
95367431640625,
476837158203125,
2384185791015625,
11920928955078125,
59604644775390625,
298023223876953125,
1490116119384765625,
7450580596923828125,
];
/// Pre-computed large power-of-5 for 32-bit limbs.
#[cfg(not(all(target_pointer_width = "64", not(target_arch = "sparc"))))]
pub const LARGE_POW5: [u32; 10] = [
4279965485, 329373468, 4020270615, 2137533757, 4287402176, 1057042919, 1071430142, 2440757623,
381945767, 46164893,
];
/// Pre-computed large power-of-5 for 64-bit limbs.
#[cfg(all(target_pointer_width = "64", not(target_arch = "sparc")))]
pub const LARGE_POW5: [u64; 5] = [
1414648277510068013,
9180637584431281687,
4539964771860779200,
10482974169319127550,
198276706040285095,
];
/// Step for large power-of-5 for 32-bit limbs.
pub const LARGE_POW5_STEP: u32 = 135;
pub fn pow(x: &mut Vec, base: u32, mut exp: u32) {
let large = &LARGE_POW5;
let step = LARGE_POW5_STEP;
while exp >= step {
large_mul(x, large);
exp -= step;
}
// Now use our pre-computed small powers iteratively.
let small_step = if LIMB_BITS == 32 {
13
} else {
27
};
let max_native = (base as Limb).pow(small_step);
while exp >= small_step {
small_mul(x, max_native);
exp -= small_step;
}
if exp != 0 {
let small_power = SMALL_INT_POW5[exp as usize];
small_mul(x, small_power as Limb);
}
}
Which is effectively self-explanatory: use large powers of 5 and the large, grade-school multiplication algorithm to minimize the number of multiplications, and then use a small-precomputed table for the remainder. The real power implementation splits this into a power-of-5 multiplication and a left-shift (for the power-of-2).
Negative Exponent Implementation
The negative exponent implementation is likewise simple, but a little different: we need our big mantissa and exponent from before, but we also needed an extended-float prior to rounding. In order to make this work with the existing Lemire algorithm, I simply add i16::MIN
(-2^15
) rather than set the exponent to -1
, so the real extended-float can be passed over. This requires only trivial modifications to existing code, which has no impact on performance for faster algorithms.
First, we create a big integer representing b+h
, so we can determine if we round to b+u
or to b
. This is very simple:
/// Calculate `b` from a a representation of `b` as a float.
#[inline]
pub fn b<F: RawFloat>(float: F) -> ExtendedFloat80 {
ExtendedFloat80 {
mant: float.mantissa().as_u64(),
exp: float.exponent(),
}
}
/// Calculate `b+h` from a a representation of `b` as a float.
#[inline]
pub fn bh<F: RawFloat>(float: F) -> ExtendedFloat80 {
let fp = b(float);
ExtendedFloat80 {
mant: (fp.mant << 1) + 1,
exp: fp.exp - 1,
}
}
Next, we then calculate both the numerator and denominator of a ratio representing the float:
/// Generate the significant digits with a negative exponent relative to mantissa.
pub fn negative_digit_comp<F: RawFloat>(
bigmant: Bigint,
mut fp: ExtendedFloat80,
exponent: i32,
) -> ExtendedFloat80 {
// Ensure our preconditions are valid:
// 1. The significant digits are not shifted into place.
debug_assert!(fp.mant & (1 << 63) != 0);
// Get the significant digits and radix exponent for the real digits.
let mut real_digits = bigmant;
let real_exp = exponent;
debug_assert!(real_exp < 0);
// Round down our extended-precision float and calculate `b`.
let mut b = fp;
shared::round::<F, _>(&mut b, shared::round_down);
let b = extended_to_float::<F>(b);
// Get the significant digits and the binary exponent for `b+h`.
let theor = bh(b);
let mut theor_digits = Bigint::from_u64(theor.mant);
let theor_exp = theor.exp;
// We need to scale the real digits and `b+h` digits to be the same
// order. We currently have `real_exp`, in `radix`, that needs to be
// shifted to `theor_digits` (since it is negative), and `theor_exp`
// to either `theor_digits` or `real_digits` as a power of 2 (since it
// may be positive or negative). Try to remove as many powers of 2
// as possible. All values are relative to `theor_digits`, that is,
// reflect the power you need to multiply `theor_digits` by.
let binary_exp = theor_exp - real_exp;
let halfradix_exp = -real_exp;
if halfradix_exp != 0 {
theor_digits.pow(5, halfradix_exp as u32);
}
if binary_exp > 0 {
theor_digits.pow(2, binary_exp as u32);
} else if binary_exp < 0 {
real_digits.pow(2, (-binary_exp) as u32);
}
...
}
Finally, we compare these two approximations, and determine if we round-up or down:
pub fn negative_digit_comp<F: RawFloat>(
bigmant: Bigint,
mut fp: ExtendedFloat80,
exponent: i32,
) -> ExtendedFloat80 {
...
// Compare our theoretical and real digits and round nearest, tie even.
let ord = real_digits.data.cmp(&theor_digits.data);
shared::round::<F, _>(&mut fp, |f, s| {
shared::round_nearest_tie_even(f, s, |is_odd, _, _| {
// Can ignore `is_halfway` and `is_above`, since those were
// calculates using less significant digits.
match ord {
cmp::Ordering::Greater => true,
cmp::Ordering::Less => false,
cmp::Ordering::Equal if is_odd => true,
cmp::Ordering::Equal => false,
}
});
});
fp
}
Specifics
First, we use 64-bit limbs platforms with native 128-bit multiplication is supported or 64-bit multiplication where you can extract both the high and the low bits efficiently. In practice, this means practically every 64-bit architecture except for SPARCv8 and SPARCv9 uses 64-bit limbs. This is detected by the code compiled with gcc main.c -c -S -O3 -masm=intel
.
#include <stdint.h>
struct i128 {
uint64_t hi;
uint64_t lo;
};
// Type your code here, or load an example.
struct i128 square(uint64_t x, uint64_t y) {
__int128 prod = (__int128)x * (__int128)y;
struct i128 z;
z.hi = (uint64_t)(prod >> 64);
z.lo = (uint64_t)prod;
return z;
}
If the compiled code has call __multi3
, then the 128-bit multiplication is emulated by GCC. Otherwise, there is native platform support. Architectures with 128-bit support include:
- x86_64 (Supported via
MUL
). - mips64 (Supported via
DMULTU
, whichHI
andLO
can be read-from). - s390x (Supported via
MLGR
).
And architectures where 64-bit limbs are more efficient than 32-bit limbs are:
- aarch64 (Requires
UMULH
andMUL
to capture high and low bits). - powerpc64 (Requires
MULHDU
andMULLD
to capture high and low bits). - riscv64 (Requires
MUL
andMULH
to capture high and low bits).
Caveats
This might be slower on 32-bit architectures or those without native 64-bit multiplication support. However, most modern architectures have native 64-bit or 128-bit multiplication.
Proof of Concept Code
A working implementation for the big integer primitives can be found here. The slow path algorithms can be found here, for positive_digit_comp
and negative_digit_comp
. Note that this exact implementation hasn't been fully tested, but is a minimal fork from existing lexical, so comparable code has been battle tested from nearly 3 years of use in production by millions of users.
Benchmarks
The benchmarks were run on the following values, using the fast-float-rust
implementation integrated into Rust standard:
// Example large, near-halfway value.
const LARGE: &str = "8.988465674311580536566680e307";
// Example long, large, near-halfway value.
const LARGE_LONG: &str = "8.9884656743115805365666807213050294962762414131308158973971342756154045415486693752413698006024096935349884403114202125541629105369684531108613657287705365884742938136589844238179474556051429647415148697857438797685859063890851407391008830874765563025951597582513936655578157348020066364210154316532161708032e307";
// Example denormal, near-halfway value.
const DENORMAL: &str = "8.442911973260991817129021e-309";
// Example of a long, denormal, near-halfway value.
const DENORMAL_LONG: &str = "2.4703282292062327208828439643411068618252990130716238221279284125033775363510437593264991818081799618989828234772285886546332835517796989819938739800539093906315035659515570226392290858392449105184435931802849936536152500319370457678249219365623669863658480757001585769269903706311928279558551332927834338409351978015531246597263579574622766465272827220056374006485499977096599470454020828166226237857393450736339007967761930577506740176324673600968951340535537458516661134223766678604162159680461914467291840300530057530849048765391711386591646239524912623653881879636239373280423891018672348497668235089863388587925628302755995657524455507255189313690836254779186948667994968324049705821028513185451396213837722826145437693412532098591327667236328124999e-324";
These are meant to represent floats that use positive_digit_comp
or negative_digit_comp
.
The microbenchmark results are as follows:
core/large time: [13.443 us 13.464 us 13.487 us]
core/large_long time: [4.7838 us 4.8724 us 4.9711 us]
core/denormal time: [65.280 ns 65.744 ns 66.180 ns]
core/denormal_long time: [36.856 us 36.905 us 36.960 us]
lexical/large time: [371.46 ns 374.19 ns 378.37 ns]
lexical/large_long time: [1.1639 us 1.1745 us 1.1889 us]
lexical/denormal time: [55.191 ns 55.497 ns 55.901 ns]
lexical/denormal_long time: [799.36 ns 818.20 ns 836.18 ns]
As you can see, the big integer algorithm outperforms the decimal implementation in every case, and performs exceptionally well in a few notable edge-cases. This was tested on x86_64, so benchmarks on other architectures may be required... (specifically, 32-bit architectures and ARM-64, which may have slightly less-efficient use of 64-bit limbs).
License
I own all the copyright to the aforementioned code, and am happy to provide it, in a PR, under any license terms, including public domain. So, no licensing issues exist.