SIMD-enabled utf-8 validation

## Introduction

The ["Parsing Gigabytes of JSON per second"](https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/) post ([ArXiv - langdale, lemire](https://t.co/MgQINMJlNB?amp=1)) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.

Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function *much* faster than conventional approaches.

Since rustc does a *lot* of utf-8 validation (each `.rs` source file needs to be valid utf-8), it 
got me curious about what rustc currently does. Validation seems to be done by the following routine:

https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L1500-L1618

This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a *lot*, so I figured it might be interesting to use a more efficient algorithm for.

## Performance improvements

The post ["Validating UTF-8 strings using as little as 0.7 cycles per byte"](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) shows about an order of magnitude performance improvement on validating utf-8, going from `8` cycles per byte parsed to `0.7` cycles per byte parsed.

When passing Rust's validation code through the godbolt decompiler, `from_utf8_unchecked` outputs 7 instructions, and `from_utf8` outputs 57 instructions. In the case of `from_utf8` most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.

- [assembly for str::from_utf8_unchecked (godbolt) - 7 lines](https://godbolt.org/z/Y9mwfd)
- [assembly for str::from_utf8 (godbolt) - 57 lines](https://godbolt.org/z/ZJk8mL)
- [assembly for run_utf8_validation routine (godbolt) - 183 lines](https://godbolt.org/z/sQteLm)

_All examples linked have been compiled with `-O -C target-cpu=native`._

Also ecosystem libraries such as `serde_json` perform utf-8 [validation in several locations](https://github.com/serde-rs/json/search?q=utf8&unscoped_q=utf8), so would likely also benefit from performance improvements to Rust's utf-8 validation routines.

## Implementation

There are two known Rust implementations of lemire's algorithm available in Rust today:

- [simd-lite/simdjson-rs](https://github.com/simd-lite/simdjson-rs)
- [argnidagur/rust-isutf8](https://github.com/ArniDagur/rust-isutf8)

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The `simdjson` library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such as `AVX2`).

These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.

## Future work

[Lemire's post](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.

https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L4136-L4141

Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.

## Conclusion

In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.

I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!

cc/ @jonas-schievink @nnethercote 

## References
- [Parsing Gigabytes of JSON per second](https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/)
- [simd-lite/simdjson-rs](https://github.com/simd-lite/simdjson-rs)
- [argnidagur/rust-isutf8](https://github.com/ArniDagur/rust-isutf8)
- [lemire/simdjson](https://github.com/lemire/simdjson)
- [Validating UTF-8 strings using as little as 0.7 cycles per byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/)
- [assembly for str::from_utf8_unchecked (godbolt) - 7 lines](https://godbolt.org/z/Y9mwfd)
- [assembly for str::from_utf8 (godbolt) - 57 lines](https://godbolt.org/z/ZJk8mL)
- [assembly for run_utf8_validation routine (godbolt) - 183 lines](https://godbolt.org/z/sQteLm)

	/// Walks through `v` checking that it's a valid UTF-8 sequence,
	/// returning `Ok(())` in that case, or, if it is invalid, `Err(err)`.
	#[inline]
	fn run_utf8_validation(v: &[u8]) -> Result<(), Utf8Error> {
	let mut index = 0;
	let len = v.len();

	let usize_bytes = mem::size_of::<usize>();
	let ascii_block_size = 2 * usize_bytes;
	let blocks_end = if len >= ascii_block_size { len - ascii_block_size + 1 } else { 0 };
	let align = v.as_ptr().align_offset(usize_bytes);

	while index < len {
	let old_offset = index;
	macro_rules! err {
	($error_len: expr) => {
	return Err(Utf8Error { valid_up_to: old_offset, error_len: $error_len });
	};
	}

	macro_rules! next {
	() => {{
	index += 1;
	// we needed data, but there was none: error!
	if index >= len {
	err!(None)
	}
	v[index]
	}};
	}

	let first = v[index];
	if first >= 128 {
	let w = UTF8_CHAR_WIDTH[first as usize];
	// 2-byte encoding is for codepoints \u{0080} to \u{07ff}
	// first C2 80 last DF BF
	// 3-byte encoding is for codepoints \u{0800} to \u{ffff}
	// first E0 A0 80 last EF BF BF
	// excluding surrogates codepoints \u{d800} to \u{dfff}
	// ED A0 80 to ED BF BF
	// 4-byte encoding is for codepoints \u{1000}0 to \u{10ff}ff
	// first F0 90 80 80 last F4 8F BF BF
	//
	// Use the UTF-8 syntax from the RFC
	//
	// https://tools.ietf.org/html/rfc3629
	// UTF8-1 = %x00-7F
	// UTF8-2 = %xC2-DF UTF8-tail
	// UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
	// %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
	// UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
	// %xF4 %x80-8F 2( UTF8-tail )
	match w {
	2 => {
	if next!() & !CONT_MASK != TAG_CONT_U8 {
	err!(Some(1))
	}
	}
	3 => {
	match (first, next!()) {
	(0xE0, 0xA0..=0xBF)
	\| (0xE1..=0xEC, 0x80..=0xBF)
	\| (0xED, 0x80..=0x9F)
	\| (0xEE..=0xEF, 0x80..=0xBF) => {}
	_ => err!(Some(1)),
	}
	if next!() & !CONT_MASK != TAG_CONT_U8 {
	err!(Some(2))
	}
	}
	4 => {
	match (first, next!()) {
	(0xF0, 0x90..=0xBF) \| (0xF1..=0xF3, 0x80..=0xBF) \| (0xF4, 0x80..=0x8F) => {}
	_ => err!(Some(1)),
	}
	if next!() & !CONT_MASK != TAG_CONT_U8 {
	err!(Some(2))
	}
	if next!() & !CONT_MASK != TAG_CONT_U8 {
	err!(Some(3))
	}
	}
	_ => err!(Some(1)),
	}
	index += 1;
	} else {
	// Ascii case, try to skip forward quickly.
	// When the pointer is aligned, read 2 words of data per iteration
	// until we find a word containing a non-ascii byte.
	if align != usize::max_value() && align.wrapping_sub(index) % usize_bytes == 0 {
	let ptr = v.as_ptr();
	while index < blocks_end {
	// SAFETY: since `align - index` and `ascii_block_size` are
	// multiples of `usize_bytes`, `block = ptr.add(index)` is
	// always aligned with a `usize` so it's safe to dereference
	// both `block` and `block.offset(1)`.
	unsafe {
	let block = ptr.add(index) as *const usize;
	// break if there is a nonascii byte
	let zu = contains_nonascii(*block);
	let zv = contains_nonascii(*block.offset(1));
	if zu \| zv {
	break;
	}
	}
	index += ascii_block_size;
	}
	// step from the point where the wordwise loop stopped
	while index < len && v[index] < 128 {
	index += 1;
	}
	} else {
	index += 1;
	}
	}
	}

	Ok(())
	}

	pub fn is_ascii(&self) -> bool {
	// We can treat each byte as character here: all multibyte characters
	// start with a byte that is not in the ascii range, so we will stop
	// there already.
	self.bytes().all(\|b\| b.is_ascii())
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD-enabled utf-8 validation #68455

yoshuawuyts
openedon Jan 22, 2020

Introduction

Performance improvements

Implementation

Future work

Conclusion

References

Assignees

Labels

Type

Projects

Milestone

Relationships

Development