Use iterators for std::str::is_utf8 & part of from_utf8_lossy. #12314

huonw · 2014-02-16T08:46:22Z

See the commit messages for more details, but this makes std::str::is_utf8 slightly faster and 100% non-unsafe and uses a similar thing to make the first scan of from_utf8_lossy 100% safe & faster.

huonw · 2014-02-16T08:53:11Z

cc @alexcrichton @kballard

Also:

Warning (don't merge)

This was fixed, see #12314 (comment)

The commit message of the last commit is a lie, currently I see

Before:

test str::bench::from_utf8_lossy_100_ascii            ... bench:       145 ns/iter (+/- 11)
test str::bench::from_utf8_lossy_100_invalid          ... bench:       451 ns/iter (+/- 169)
test str::bench::from_utf8_lossy_100_multibyte        ... bench:       137 ns/iter (+/- 9)
test str::bench::from_utf8_lossy_invalid              ... bench:       130 ns/iter (+/- 10)

After:

test str::bench::from_utf8_lossy_100_ascii            ... bench:       124 ns/iter (+/- 9)
test str::bench::from_utf8_lossy_100_invalid          ... bench:       802 ns/iter (+/- 25)
test str::bench::from_utf8_lossy_100_multibyte        ... bench:       118 ns/iter (+/- 6)
test str::bench::from_utf8_lossy_invalid              ... bench:       134 ns/iter (+/- 8)

Which makes no sense at all, since I'm just changing the internal details of a function called exactly once, which exits immediately for the 100_invalid case (the first byte is invalid).

Could someone(s) run the following and tell me the results:

git fetch git@github.com:huonw/rust.git is_utf8_iter

git checkout FETCH_HEAD
echo new
make check-stage1-std TESTNAME=str::bench::from_utf8_lossy

git checkout HEAD^ 
echo old
make check-stage1-std NO_REBUILD=1 TESTNAME=str::bench::from_utf8_lossy

lilyball · 2014-02-16T19:36:55Z

Ack, I think makefiles have issues. The second make gives me "Nothing to be done for `all'"

lilyball · 2014-02-16T19:48:13Z

By adding a rm x86_64-apple-darwin/stage1/test/stdtest-* I was able to rebuild after the second checkout, and that got me these numbers:

new:
running 4 tests
test str::bench::from_utf8_lossy_100_ascii                      ... bench:        97 ns/iter (+/- 10)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       369 ns/iter (+/- 18)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       103 ns/iter (+/- 3)
test str::bench::from_utf8_lossy_invalid                        ... bench:       102 ns/iter (+/- 8)

old:
running 4 tests
test str::bench::from_utf8_lossy_100_ascii                      ... bench:       112 ns/iter (+/- 2)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       384 ns/iter (+/- 17)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       103 ns/iter (+/- 3)
test str::bench::from_utf8_lossy_invalid                        ... bench:       105 ns/iter (+/- 4)

but I'm not entirely convinced at the moment it's building correctly, because a subsequent make clean; make check-stage1-std errored out due to no such file libuv.a.

huonw · 2014-02-17T00:16:30Z

That looks nicer! It doesn't explain what's going on with me... And I don't understand why 100_invalid is faster (it spends very little time in the first_non_utf8_index) but 100_multibyte doesn't change at all.

Could someone else have a go too?

lilyball · 2014-02-17T00:17:50Z

@huonw What platform/architecture are you testing on? Maybe it's some sort of architecture-dependent issue?

huonw · 2014-02-17T00:20:36Z

x86-64 linux with a i7-3517u. Another thing: some tiny tweaks (like changing the names of the functions) gives results more similar to yours.

lilyball · 2014-02-17T00:21:34Z

Names? Really? How strange. What if you simply make clean && make, maybe rebuilding alone is sufficient to change the results?

huonw · 2014-02-17T00:23:35Z

Yeah, changing the name to first_non_utf8_index2 sometimes made it fast (I think some other names had to change too)... I'm getting the feeling that it is a broken build set up. I'm away from my Rust-computer at the moment so I can't test.

That said, I'll just iterator-ise the whole from_utf8_lossy function and hopefully that will have a similar effect as it does on is_utf8.

huonw · 2014-02-17T12:49:08Z

So anyway with a rebase and a make clean, the weird artifacts have disappeared and I'm now seeing (i.e. faster in every case but the 100_invalid where it is equal, which is what I would expect):

str::bench::from_utf8_lossy_100_ascii        ... bench:     119 ns/iter (+/- 8)
str::bench::from_utf8_lossy_100_invalid      ... bench:     454 ns/iter (+/- 16)
str::bench::from_utf8_lossy_100_multibyte    ... bench:     116 ns/iter (+/- 9)
str::bench::from_utf8_lossy_invalid          ... bench:     119 ns/iter (+/- 9)

I think this is ready for merging.

lilyball · 2014-02-17T18:14:50Z

Code looks good to me. I'll run benchmarks myself shortly.

lilyball · 2014-02-17T18:56:03Z

Latest benchmark:

test str::bench::from_utf8_lossy_100_ascii                      ... bench:        90 ns/iter (+/- 1)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       386 ns/iter (+/- 5)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:        85 ns/iter (+/- 2)
test str::bench::from_utf8_lossy_invalid                        ... bench:       106 ns/iter (+/- 1)

huonw · 2014-02-18T00:22:19Z

(I just realised I've left outdated comments in the source, I've removed the r+... don't r+ again for now, please.)

This uses a vector iterator to avoid the necessity for unsafe indexing, and makes this function slightly faster. Unfortunately rust-lang#11751 means that the iterator comes with repeated `null` checks which means the pure-ASCII case still has room for significant improvement (and the other cases too, but it's most significant for just ASCII). Before: is_utf8_100_ascii ... bench: 143 ns/iter (+/- 6) is_utf8_100_multibyte ... bench: 134 ns/iter (+/- 4) After: is_utf8_100_ascii ... bench: 123 ns/iter (+/- 4) is_utf8_100_multibyte ... bench: 115 ns/iter (+/- 5)

This makes it very slightly faster, especially when the string is valid UTF-8, and completely removes the use of `unsafe` from the first half. Before: from_utf8_lossy_100_ascii ... bench: 151 ns/iter (+/- 17) from_utf8_lossy_100_invalid ... bench: 447 ns/iter (+/- 33) from_utf8_lossy_100_multibyte ... bench: 135 ns/iter (+/- 4) from_utf8_lossy_invalid ... bench: 124 ns/iter (+/- 10 After: from_utf8_lossy_100_ascii ... bench: 119 ns/iter (+/- 8) from_utf8_lossy_100_invalid ... bench: 454 ns/iter (+/- 16) from_utf8_lossy_100_multibyte ... bench: 116 ns/iter (+/- 9) from_utf8_lossy_invalid ... bench: 119 ns/iter (+/- 9)

See the commit messages for more details, but this makes `std::str::is_utf8` slightly faster and 100% non-`unsafe` and uses a similar thing to make the first scan of `from_utf8_lossy` 100% safe & faster.

…nas-schievink minor: simplify

…=flip1995 Default test output conflict handling to error oli-obk/ui_test#175 got rid of the `bool` that controlled the default handling so we need to specify it ourselves r? `@flip1995` changelog: none

huonw mentioned this pull request Feb 16, 2014

std::str: optimize is_utf8 & first_non_utf8_index. #12241

Closed

huonw added 2 commits February 18, 2014 21:55

bors closed this Feb 18, 2014

bors merged commit a39056e into rust-lang:master Feb 18, 2014

huonw deleted the is_utf8_iter branch February 25, 2014 05:26

bors added a commit to rust-lang-ci/rust that referenced this pull request Jul 25, 2022

Auto merge of rust-lang#12314 - jonas-schievink:proc-macro-load, r=jo…

eba26af

…nas-schievink minor: simplify

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use iterators for std::str::is_utf8 & part of from_utf8_lossy. #12314

Use iterators for std::str::is_utf8 & part of from_utf8_lossy. #12314

huonw commented Feb 16, 2014

huonw commented Feb 16, 2014

lilyball commented Feb 16, 2014

lilyball commented Feb 16, 2014

huonw commented Feb 17, 2014

lilyball commented Feb 17, 2014

huonw commented Feb 17, 2014

lilyball commented Feb 17, 2014

huonw commented Feb 17, 2014

huonw commented Feb 17, 2014

lilyball commented Feb 17, 2014

lilyball commented Feb 17, 2014

huonw commented Feb 18, 2014

Use iterators for std::str::is_utf8 & part of from_utf8_lossy. #12314

Use iterators for std::str::is_utf8 & part of from_utf8_lossy. #12314

Conversation

huonw commented Feb 16, 2014

huonw commented Feb 16, 2014

Warning (don't merge)

This was fixed, see #12314 (comment)

lilyball commented Feb 16, 2014

lilyball commented Feb 16, 2014

huonw commented Feb 17, 2014

lilyball commented Feb 17, 2014

huonw commented Feb 17, 2014

lilyball commented Feb 17, 2014

huonw commented Feb 17, 2014

huonw commented Feb 17, 2014

lilyball commented Feb 17, 2014

lilyball commented Feb 17, 2014

huonw commented Feb 18, 2014