Improve/modernize the handling of utf16 data in std::str #12317

huonw · 2014-02-16T13:12:35Z

Iterators! Use them (in is_utf16), create them (in utf16_items).

Handle errors gracefully (from_utf16_lossy) and from_utf16 returning Option<~str> instead of failing.

Add a pile of tests.

huonw · 2014-02-16T14:26:35Z

(This doesn't pass tests yet... will fix tomorrow)

SimonSapin · 2014-02-16T15:15:14Z

I reviewed 797396b791f4c2b78e2d3ef4e75aca5d6801e594 and preceding commits. Looks good to me.

SimonSapin · 2014-02-16T19:50:30Z

To align with how from_utf8 works, from_utf16 should return an Option rather than failing. That doesn’t need to be in this PR, though.

huonw · 2014-02-16T23:20:29Z

Ready for review: fixed tests & made from_utf16 return an Option.

alexcrichton · 2014-02-16T23:25:08Z

src/libstd/str.rs

+/// use std::str;
+/// use std::str::{ScalarValue, LoneSurrogate};
+///
+/// // 


We don't quite have an equivalent for a utf8 iterator like this, so perhaps this shouldn't be a public api just yet?

I may be missing some use cases though.

Not sure, but it was previously a public internal iterator (http://static.rust-lang.org/doc/master/std/str/fn.utf16_chars.html), so this change is only modernizing & adding the error-resistance, not adding something completely new. (Don't know if that's at all relevant to a decision...)

Ah, I was unaware of that!

Most of the tests are randomly generated with Python 3 and rely on it's UTF-16be encoder/decoder being correct.

Fixes rust-lang#12316.

This replaces the iterator with one that handles lone surrogates gracefully and uses that to implement `from_utf16_lossy` which replaces invalid `u16`s with U+FFFD.

Fixes rust-lang#12318.

The rest of the codebase is moving toward avoiding `fail!` so we do it here too!

huonw · 2014-02-17T13:03:45Z

Failed on windows because Windows actually uses UTF-16 for APIs (:cry:) and things generally want ~str rather than Option<~str>; could someone just double check the extra .expects I've added to std::os and native::io::file in the last commit?

huonw · 2014-02-18T08:14:59Z

It looks like some places relied on the old behaviour of utf8_chars which stopped on NUL.

SimonSapin · 2014-02-18T08:50:12Z

Ideally, NUL-termination should be orthogonal to decoding 16 bit units into code points. Could we have a separate fn truncate_at_nul<'a>(&'a [u16]) -> &'a [u16], or is this code performance-sensitive enough that this needs to be done in the same pass as decoding? In any case, decoding without truncation should also be available (and be the default).

huonw · 2014-02-18T09:21:20Z

Yeah, of course; I was just explaining the failure. :)

huonw · 2014-02-18T12:48:02Z

I've added the truncation function.

Many of the functions interacting with Windows APIs allocate a vector of 0's and do not retrieve a length directly from the API call, and so need to be sure to remove the unmodified junk at the end of the vector.

Iterators! Use them (in `is_utf16`), create them (in `utf16_items`). Handle errors gracefully (`from_utf16_lossy`) and `from_utf16` returning `Option<~str>` instead of failing. Add a pile of tests.

Add check for 'in_external_macro' and 'is_from_proc_macro' inside changelog: Fix rust-lang#12291 #[tracing::instrument()] triggers infinite_loop Added an in_external_macro and is_from_proc_macro check to the [infinite_loop] lint

alexcrichton reviewed Feb 16, 2014
View reviewed changes

huonw added 5 commits February 17, 2014 23:53

std: iteratize str::is_utf16 & add tests.

493a4b6

Most of the tests are randomly generated with Python 3 and rely on it's UTF-16be encoder/decoder being correct.

std: convert str::from_utf16 to an external iterator.

b7656d0

Fixes rust-lang#12316.

str: provide lossy UTF-16 support.

a96cea4

This replaces the iterator with one that handles lone surrogates gracefully and uses that to implement `from_utf16_lossy` which replaces invalid `u16`s with U+FFFD.

std: decode even numbered non-BMP planes in the UTF-16 decoder.

35b1b62

Fixes rust-lang#12318.

std: make str::from_utf16 return an Option.

4f841ee

The rest of the codebase is moving toward avoiding `fail!` so we do it here too!

str: add a function for truncating a vector of u16 at NUL.

c9b4538

Many of the functions interacting with Windows APIs allocate a vector of 0's and do not retrieve a length directly from the API call, and so need to be sure to remove the unmodified junk at the end of the vector.

bors closed this Feb 19, 2014

huonw deleted the utf16 branch June 27, 2014 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve/modernize the handling of utf16 data in std::str #12317

Improve/modernize the handling of utf16 data in std::str #12317

huonw commented Feb 16, 2014

huonw commented Feb 16, 2014

SimonSapin commented Feb 16, 2014

SimonSapin commented Feb 16, 2014

huonw commented Feb 16, 2014

alexcrichton Feb 16, 2014

huonw Feb 16, 2014

alexcrichton Feb 16, 2014

huonw commented Feb 17, 2014

huonw commented Feb 18, 2014

SimonSapin commented Feb 18, 2014

huonw commented Feb 18, 2014

huonw commented Feb 18, 2014

Improve/modernize the handling of utf16 data in std::str #12317

Improve/modernize the handling of utf16 data in std::str #12317

Conversation

huonw commented Feb 16, 2014

huonw commented Feb 16, 2014

SimonSapin commented Feb 16, 2014

SimonSapin commented Feb 16, 2014

huonw commented Feb 16, 2014

alexcrichton Feb 16, 2014

Choose a reason for hiding this comment

huonw Feb 16, 2014

Choose a reason for hiding this comment

alexcrichton Feb 16, 2014

Choose a reason for hiding this comment

huonw commented Feb 17, 2014

huonw commented Feb 18, 2014

SimonSapin commented Feb 18, 2014

huonw commented Feb 18, 2014

huonw commented Feb 18, 2014