Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path for ASCII in UTF-8 validation #30740

Merged
merged 2 commits into from
Jan 16, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions src/libcollectionstest/str.rs
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,18 @@ fn test_is_utf8() {
assert!(from_utf8(&[0xF4, 0x8F, 0xBF, 0xBF]).is_ok());
}

#[test]
fn from_utf8_mostly_ascii() {
// deny invalid bytes embedded in long stretches of ascii
for i in 32..64 {
let mut data = [0; 128];
data[i] = 0xC0;
assert!(from_utf8(&data).is_err());
data[i] = 0xC2;
assert!(from_utf8(&data).is_err());
}
}

#[test]
fn test_is_utf16() {
use rustc_unicode::str::is_utf16;
Expand Down
87 changes: 60 additions & 27 deletions src/libcore/str/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ use option::Option::{self, None, Some};
use raw::{Repr, Slice};
use result::Result::{self, Ok, Err};
use slice::{self, SliceExt};
use usize;

pub mod pattern;

Expand Down Expand Up @@ -240,7 +241,7 @@ impl Utf8Error {
/// ```
#[stable(feature = "rust1", since = "1.0.0")]
pub fn from_utf8(v: &[u8]) -> Result<&str, Utf8Error> {
try!(run_utf8_validation_iterator(&mut v.iter()));
try!(run_utf8_validation(v));
Ok(unsafe { from_utf8_unchecked(v) })
}

Expand Down Expand Up @@ -1074,46 +1075,44 @@ unsafe fn cmp_slice(a: &str, b: &str, len: usize) -> i32 {
}

/*
Section: Misc
Section: UTF-8 validation
*/

// use truncation to fit u64 into usize
const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below, you chose to define the constant inside the method; here you did not. Is there a method to these decisions? FWIW, even if you put it in the method, I'd still vote to leave it as a named constant for readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't think much about it, but the second constant would be in the function by necessity so that it's not defined too far from its use.


/// Return `true` if any byte in the word `x` is nonascii (>= 128).
#[inline]
fn contains_nonascii(x: usize) -> bool {
(x & NONASCII_MASK) != 0
}

/// Walk through `iter` checking that it's a valid UTF-8 sequence,
/// returning `true` in that case, or, if it is invalid, `false` with
/// `iter` reset such that it is pointing at the first byte in the
/// invalid sequence.
#[inline(always)]
fn run_utf8_validation_iterator(iter: &mut slice::Iter<u8>)
-> Result<(), Utf8Error> {
let whole = iter.as_slice();
loop {
// save the current thing we're pointing at.
let old = iter.clone();

// restore the iterator we had at the start of this codepoint.
fn run_utf8_validation(v: &[u8]) -> Result<(), Utf8Error> {
let mut offset = 0;
let len = v.len();
while offset < len {
let old_offset = offset;
macro_rules! err { () => {{
*iter = old.clone();
return Err(Utf8Error {
valid_up_to: whole.len() - iter.as_slice().len()
valid_up_to: old_offset
})
}}}

macro_rules! next { () => {
match iter.next() {
Some(a) => *a,
// we needed data, but there was none: error!
None => err!(),
macro_rules! next { () => {{
offset += 1;
// we needed data, but there was none: error!
if offset >= len {
err!()
}
}}

let first = match iter.next() {
Some(&b) => b,
// we're at the end of the iterator and a codepoint
// boundary at the same time, so this string is valid.
None => return Ok(())
};
v[offset]
}}}

// ASCII characters are always valid, so only large
// bytes need more examination.
let first = v[offset];
if first >= 128 {
let w = UTF8_CHAR_WIDTH[first as usize];
let second = next!();
Expand Down Expand Up @@ -1156,8 +1155,42 @@ fn run_utf8_validation_iterator(iter: &mut slice::Iter<u8>)
}
_ => err!()
}
offset += 1;
} else {
// Ascii case, try to skip forward quickly.
// When the pointer is aligned, read 2 words of data per iteration
// until we find a word containing a non-ascii byte.
const BYTES_PER_ITERATION: usize = 2 * usize::BYTES;
let ptr = v.as_ptr();
let align = (ptr as usize + offset) & (usize::BYTES - 1);
if align == 0 {
if len >= BYTES_PER_ITERATION {
while offset <= len - BYTES_PER_ITERATION {
unsafe {
let u = *(ptr.offset(offset as isize) as *const usize);
let v = *(ptr.offset((offset + usize::BYTES) as isize) as *const usize);

// break if there is a nonascii byte
let zu = contains_nonascii(u);
let zv = contains_nonascii(v);
if zu || zv {
break;
}
}
offset += BYTES_PER_ITERATION;
}
}
// step from the point where the wordwise loop stopped
while offset < len && v[offset] < 128 {
offset += 1;
}
} else {
offset += 1;
}
}
}

Ok(())
}

// https://tools.ietf.org/html/rfc3629
Expand Down