Skip to content

SIMD-enabled utf-8 validation #68455

Open

Description

Introduction

The "Parsing Gigabytes of JSON per second" post (ArXiv - langdale, lemire) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.

Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function much faster than conventional approaches.

Since rustc does a lot of utf-8 validation (each .rs source file needs to be valid utf-8), it
got me curious about what rustc currently does. Validation seems to be done by the following routine:

rust/src/libcore/str/mod.rs

Lines 1500 to 1618 in 2f688ac

/// Walks through `v` checking that it's a valid UTF-8 sequence,
/// returning `Ok(())` in that case, or, if it is invalid, `Err(err)`.
#[inline]
fn run_utf8_validation(v: &[u8]) -> Result<(), Utf8Error> {
let mut index = 0;
let len = v.len();
let usize_bytes = mem::size_of::<usize>();
let ascii_block_size = 2 * usize_bytes;
let blocks_end = if len >= ascii_block_size { len - ascii_block_size + 1 } else { 0 };
let align = v.as_ptr().align_offset(usize_bytes);
while index < len {
let old_offset = index;
macro_rules! err {
($error_len: expr) => {
return Err(Utf8Error { valid_up_to: old_offset, error_len: $error_len });
};
}
macro_rules! next {
() => {{
index += 1;
// we needed data, but there was none: error!
if index >= len {
err!(None)
}
v[index]
}};
}
let first = v[index];
if first >= 128 {
let w = UTF8_CHAR_WIDTH[first as usize];
// 2-byte encoding is for codepoints \u{0080} to \u{07ff}
// first C2 80 last DF BF
// 3-byte encoding is for codepoints \u{0800} to \u{ffff}
// first E0 A0 80 last EF BF BF
// excluding surrogates codepoints \u{d800} to \u{dfff}
// ED A0 80 to ED BF BF
// 4-byte encoding is for codepoints \u{1000}0 to \u{10ff}ff
// first F0 90 80 80 last F4 8F BF BF
//
// Use the UTF-8 syntax from the RFC
//
// https://tools.ietf.org/html/rfc3629
// UTF8-1 = %x00-7F
// UTF8-2 = %xC2-DF UTF8-tail
// UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
// %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
// UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
// %xF4 %x80-8F 2( UTF8-tail )
match w {
2 => {
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(1))
}
}
3 => {
match (first, next!()) {
(0xE0, 0xA0..=0xBF)
| (0xE1..=0xEC, 0x80..=0xBF)
| (0xED, 0x80..=0x9F)
| (0xEE..=0xEF, 0x80..=0xBF) => {}
_ => err!(Some(1)),
}
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(2))
}
}
4 => {
match (first, next!()) {
(0xF0, 0x90..=0xBF) | (0xF1..=0xF3, 0x80..=0xBF) | (0xF4, 0x80..=0x8F) => {}
_ => err!(Some(1)),
}
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(2))
}
if next!() & !CONT_MASK != TAG_CONT_U8 {
err!(Some(3))
}
}
_ => err!(Some(1)),
}
index += 1;
} else {
// Ascii case, try to skip forward quickly.
// When the pointer is aligned, read 2 words of data per iteration
// until we find a word containing a non-ascii byte.
if align != usize::max_value() && align.wrapping_sub(index) % usize_bytes == 0 {
let ptr = v.as_ptr();
while index < blocks_end {
// SAFETY: since `align - index` and `ascii_block_size` are
// multiples of `usize_bytes`, `block = ptr.add(index)` is
// always aligned with a `usize` so it's safe to dereference
// both `block` and `block.offset(1)`.
unsafe {
let block = ptr.add(index) as *const usize;
// break if there is a nonascii byte
let zu = contains_nonascii(*block);
let zv = contains_nonascii(*block.offset(1));
if zu | zv {
break;
}
}
index += ascii_block_size;
}
// step from the point where the wordwise loop stopped
while index < len && v[index] < 128 {
index += 1;
}
} else {
index += 1;
}
}
}
Ok(())
}

This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a lot, so I figured it might be interesting to use a more efficient algorithm for.

Performance improvements

The post "Validating UTF-8 strings using as little as 0.7 cycles per byte" shows about an order of magnitude performance improvement on validating utf-8, going from 8 cycles per byte parsed to 0.7 cycles per byte parsed.

When passing Rust's validation code through the godbolt decompiler, from_utf8_unchecked outputs 7 instructions, and from_utf8 outputs 57 instructions. In the case of from_utf8 most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.

All examples linked have been compiled with -O -C target-cpu=native.

Also ecosystem libraries such as serde_json perform utf-8 validation in several locations, so would likely also benefit from performance improvements to Rust's utf-8 validation routines.

Implementation

There are two known Rust implementations of lemire's algorithm available in Rust today:

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The simdjson library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such as AVX2).

These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.

Future work

Lemire's post also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.

rust/src/libcore/str/mod.rs

Lines 4136 to 4141 in 2f688ac

pub fn is_ascii(&self) -> bool {
// We can treat each byte as character here: all multibyte characters
// start with a byte that is not in the ascii range, so we will stop
// there already.
self.bytes().all(|b| b.is_ascii())
}

Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.

Conclusion

In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.

I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!

cc/ @jonas-schievink @nnethercote

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    A-SIMDArea: SIMD (Single Instruction Multiple Data)A-target-featureArea: Enabling/disabling target features like AVX, Neon, etc.A-unicodeArea: UnicodeC-enhancementCategory: An issue proposing an enhancement or a PR with one.T-libsRelevant to the library team, which will review and decide on the PR/issue.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions