Description
Introduction
The "Parsing Gigabytes of JSON per second" post (ArXiv - langdale, lemire) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.
Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function much faster than conventional approaches.
Since rustc does a lot of utf-8 validation (each .rs
source file needs to be valid utf-8), it
got me curious about what rustc currently does. Validation seems to be done by the following routine:
Lines 1500 to 1618 in 2f688ac
This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a lot, so I figured it might be interesting to use a more efficient algorithm for.
Performance improvements
The post "Validating UTF-8 strings using as little as 0.7 cycles per byte" shows about an order of magnitude performance improvement on validating utf-8, going from 8
cycles per byte parsed to 0.7
cycles per byte parsed.
When passing Rust's validation code through the godbolt decompiler, from_utf8_unchecked
outputs 7 instructions, and from_utf8
outputs 57 instructions. In the case of from_utf8
most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.
- assembly for str::from_utf8_unchecked (godbolt) - 7 lines
- assembly for str::from_utf8 (godbolt) - 57 lines
- assembly for run_utf8_validation routine (godbolt) - 183 lines
All examples linked have been compiled with -O -C target-cpu=native
.
Also ecosystem libraries such as serde_json
perform utf-8 validation in several locations, so would likely also benefit from performance improvements to Rust's utf-8 validation routines.
Implementation
There are two known Rust implementations of lemire's algorithm available in Rust today:
The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.
I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The simdjson
library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such as AVX2
).
These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.
Future work
Lemire's post also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.
Lines 4136 to 4141 in 2f688ac
Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.
Conclusion
In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.
I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!
cc/ @jonas-schievink @nnethercote
References
- Parsing Gigabytes of JSON per second
- simd-lite/simdjson-rs
- argnidagur/rust-isutf8
- lemire/simdjson
- Validating UTF-8 strings using as little as 0.7 cycles per byte
- assembly for str::from_utf8_unchecked (godbolt) - 7 lines
- assembly for str::from_utf8 (godbolt) - 57 lines
- assembly for run_utf8_validation routine (godbolt) - 183 lines