Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use iterators for std::str::is_utf8 & part of from_utf8_lossy. #12314

Merged
merged 2 commits into from
Feb 18, 2014

Conversation

huonw
Copy link
Member

@huonw huonw commented Feb 16, 2014

See the commit messages for more details, but this makes std::str::is_utf8 slightly faster and 100% non-unsafe and uses a similar thing to make the first scan of from_utf8_lossy 100% safe & faster.

@huonw
Copy link
Member Author

huonw commented Feb 16, 2014

cc @alexcrichton @kballard

Also:

Warning (don't merge)

This was fixed, see #12314 (comment)

The commit message of the last commit is a lie, currently I see

Before:

test str::bench::from_utf8_lossy_100_ascii            ... bench:       145 ns/iter (+/- 11)
test str::bench::from_utf8_lossy_100_invalid          ... bench:       451 ns/iter (+/- 169)
test str::bench::from_utf8_lossy_100_multibyte        ... bench:       137 ns/iter (+/- 9)
test str::bench::from_utf8_lossy_invalid              ... bench:       130 ns/iter (+/- 10)

After:

test str::bench::from_utf8_lossy_100_ascii            ... bench:       124 ns/iter (+/- 9)
test str::bench::from_utf8_lossy_100_invalid          ... bench:       802 ns/iter (+/- 25)
test str::bench::from_utf8_lossy_100_multibyte        ... bench:       118 ns/iter (+/- 6)
test str::bench::from_utf8_lossy_invalid              ... bench:       134 ns/iter (+/- 8)

Which makes no sense at all, since I'm just changing the internal details of a function called exactly once, which exits immediately for the 100_invalid case (the first byte is invalid).

Could someone(s) run the following and tell me the results:

git fetch git@github.com:huonw/rust.git is_utf8_iter

git checkout FETCH_HEAD
echo new
make check-stage1-std TESTNAME=str::bench::from_utf8_lossy

git checkout HEAD^ 
echo old
make check-stage1-std NO_REBUILD=1 TESTNAME=str::bench::from_utf8_lossy

@lilyball
Copy link
Contributor

Ack, I think makefiles have issues. The second make gives me "Nothing to be done for `all'"

@lilyball
Copy link
Contributor

By adding a rm x86_64-apple-darwin/stage1/test/stdtest-* I was able to rebuild after the second checkout, and that got me these numbers:

new:
running 4 tests
test str::bench::from_utf8_lossy_100_ascii                      ... bench:        97 ns/iter (+/- 10)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       369 ns/iter (+/- 18)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       103 ns/iter (+/- 3)
test str::bench::from_utf8_lossy_invalid                        ... bench:       102 ns/iter (+/- 8)

old:
running 4 tests
test str::bench::from_utf8_lossy_100_ascii                      ... bench:       112 ns/iter (+/- 2)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       384 ns/iter (+/- 17)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:       103 ns/iter (+/- 3)
test str::bench::from_utf8_lossy_invalid                        ... bench:       105 ns/iter (+/- 4)

but I'm not entirely convinced at the moment it's building correctly, because a subsequent make clean; make check-stage1-std errored out due to no such file libuv.a.

@huonw
Copy link
Member Author

huonw commented Feb 17, 2014

That looks nicer! It doesn't explain what's going on with me... And I don't understand why 100_invalid is faster (it spends very little time in the first_non_utf8_index) but 100_multibyte doesn't change at all.

Could someone else have a go too?

@lilyball
Copy link
Contributor

@huonw What platform/architecture are you testing on? Maybe it's some sort of architecture-dependent issue?

@huonw
Copy link
Member Author

huonw commented Feb 17, 2014

x86-64 linux with a i7-3517u. Another thing: some tiny tweaks (like changing the names of the functions) gives results more similar to yours.

@lilyball
Copy link
Contributor

Names? Really? How strange. What if you simply make clean && make, maybe rebuilding alone is sufficient to change the results?

@huonw
Copy link
Member Author

huonw commented Feb 17, 2014

Yeah, changing the name to first_non_utf8_index2 sometimes made it fast (I think some other names had to change too)... I'm getting the feeling that it is a broken build set up. I'm away from my Rust-computer at the moment so I can't test.

That said, I'll just iterator-ise the whole from_utf8_lossy function and hopefully that will have a similar effect as it does on is_utf8.

@huonw
Copy link
Member Author

huonw commented Feb 17, 2014

So anyway with a rebase and a make clean, the weird artifacts have disappeared and I'm now seeing (i.e. faster in every case but the 100_invalid where it is equal, which is what I would expect):

str::bench::from_utf8_lossy_100_ascii        ... bench:     119 ns/iter (+/- 8)
str::bench::from_utf8_lossy_100_invalid      ... bench:     454 ns/iter (+/- 16)
str::bench::from_utf8_lossy_100_multibyte    ... bench:     116 ns/iter (+/- 9)
str::bench::from_utf8_lossy_invalid          ... bench:     119 ns/iter (+/- 9)

I think this is ready for merging.

@lilyball
Copy link
Contributor

Code looks good to me. I'll run benchmarks myself shortly.

@lilyball
Copy link
Contributor

Latest benchmark:

test str::bench::from_utf8_lossy_100_ascii                      ... bench:        90 ns/iter (+/- 1)
test str::bench::from_utf8_lossy_100_invalid                    ... bench:       386 ns/iter (+/- 5)
test str::bench::from_utf8_lossy_100_multibyte                  ... bench:        85 ns/iter (+/- 2)
test str::bench::from_utf8_lossy_invalid                        ... bench:       106 ns/iter (+/- 1)

@huonw
Copy link
Member Author

huonw commented Feb 18, 2014

(I just realised I've left outdated comments in the source, I've removed the r+... don't r+ again for now, please.)

This uses a vector iterator to avoid the necessity for unsafe indexing,
and makes this function slightly faster. Unfortunately rust-lang#11751 means that
the iterator comes with repeated `null` checks which means the
pure-ASCII case still has room for significant improvement (and the
other cases too, but it's most significant for just ASCII).

Before:

    is_utf8_100_ascii             ... bench:       143 ns/iter (+/- 6)
    is_utf8_100_multibyte         ... bench:       134 ns/iter (+/- 4)

After:

    is_utf8_100_ascii             ... bench:       123 ns/iter (+/- 4)
    is_utf8_100_multibyte         ... bench:       115 ns/iter (+/- 5)
This makes it very slightly faster, especially when the string is valid
UTF-8, and completely removes the use of `unsafe` from the first half.

Before:

    from_utf8_lossy_100_ascii              ... bench:       151 ns/iter (+/- 17)
    from_utf8_lossy_100_invalid            ... bench:       447 ns/iter (+/- 33)
    from_utf8_lossy_100_multibyte          ... bench:       135 ns/iter (+/- 4)
    from_utf8_lossy_invalid                ... bench:       124 ns/iter (+/- 10

After:

    from_utf8_lossy_100_ascii              ... bench:       119 ns/iter (+/- 8)
    from_utf8_lossy_100_invalid            ... bench:       454 ns/iter (+/- 16)
    from_utf8_lossy_100_multibyte          ... bench:       116 ns/iter (+/- 9)
    from_utf8_lossy_invalid                ... bench:       119 ns/iter (+/- 9)
bors added a commit that referenced this pull request Feb 18, 2014
See the commit messages for more details, but this makes `std::str::is_utf8` slightly faster and 100% non-`unsafe` and uses a similar thing to make the first scan of `from_utf8_lossy` 100% safe & faster.
@bors bors closed this Feb 18, 2014
@bors bors merged commit a39056e into rust-lang:master Feb 18, 2014
@huonw huonw deleted the is_utf8_iter branch February 25, 2014 05:26
bors added a commit to rust-lang-ci/rust that referenced this pull request Jul 25, 2022
flip1995 pushed a commit to flip1995/rust that referenced this pull request Feb 26, 2024
…=flip1995

Default test output conflict handling to error

oli-obk/ui_test#175 got rid of the `bool` that controlled the default handling so we need to specify it ourselves

r? `@flip1995`

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants