Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/modernize the handling of utf16 data in std::str #12317

Closed
wants to merge 6 commits into from

Conversation

huonw
Copy link
Member

@huonw huonw commented Feb 16, 2014

Iterators! Use them (in is_utf16), create them (in utf16_items).

Handle errors gracefully (from_utf16_lossy) and from_utf16 returning Option<~str> instead of failing.

Add a pile of tests.

@huonw
Copy link
Member Author

huonw commented Feb 16, 2014

(This doesn't pass tests yet... will fix tomorrow)

@SimonSapin
Copy link
Contributor

I reviewed 797396b791f4c2b78e2d3ef4e75aca5d6801e594 and preceding commits. Looks good to me.

@SimonSapin
Copy link
Contributor

To align with how from_utf8 works, from_utf16 should return an Option rather than failing. That doesn’t need to be in this PR, though.

@huonw
Copy link
Member Author

huonw commented Feb 16, 2014

Ready for review: fixed tests & made from_utf16 return an Option.

/// use std::str;
/// use std::str::{ScalarValue, LoneSurrogate};
///
/// //
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't quite have an equivalent for a utf8 iterator like this, so perhaps this shouldn't be a public api just yet?

I may be missing some use cases though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but it was previously a public internal iterator (http://static.rust-lang.org/doc/master/std/str/fn.utf16_chars.html), so this change is only modernizing & adding the error-resistance, not adding something completely new. (Don't know if that's at all relevant to a decision...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was unaware of that!

Most of the tests are randomly generated with Python 3 and rely on it's
UTF-16be encoder/decoder being correct.
This replaces the iterator with one that handles lone surrogates
gracefully and uses that to implement `from_utf16_lossy` which replaces
invalid `u16`s with U+FFFD.
The rest of the codebase is moving toward avoiding `fail!` so we do it
here too!
@huonw
Copy link
Member Author

huonw commented Feb 17, 2014

Failed on windows because Windows actually uses UTF-16 for APIs (:cry:) and things generally want ~str rather than Option<~str>; could someone just double check the extra .expects I've added to std::os and native::io::file in the last commit?

@huonw
Copy link
Member Author

huonw commented Feb 18, 2014

It looks like some places relied on the old behaviour of utf8_chars which stopped on NUL.

@SimonSapin
Copy link
Contributor

Ideally, NUL-termination should be orthogonal to decoding 16 bit units into code points. Could we have a separate fn truncate_at_nul<'a>(&'a [u16]) -> &'a [u16], or is this code performance-sensitive enough that this needs to be done in the same pass as decoding? In any case, decoding without truncation should also be available (and be the default).

@huonw
Copy link
Member Author

huonw commented Feb 18, 2014

Yeah, of course; I was just explaining the failure. :)

@huonw
Copy link
Member Author

huonw commented Feb 18, 2014

I've added the truncation function.

Many of the functions interacting with Windows APIs allocate a vector of
0's and do not retrieve a length directly from the API call, and so need
to be sure to remove the unmodified junk at the end of the vector.
bors added a commit that referenced this pull request Feb 19, 2014
Iterators! Use them (in `is_utf16`), create them (in `utf16_items`).

Handle errors gracefully (`from_utf16_lossy`) and `from_utf16` returning `Option<~str>` instead of failing.

Add a pile of tests.
@bors bors closed this Feb 19, 2014
@huonw huonw deleted the utf16 branch June 27, 2014 06:48
flip1995 pushed a commit to flip1995/rust that referenced this pull request Feb 26, 2024
Add check for 'in_external_macro' and 'is_from_proc_macro' inside

changelog: Fix rust-lang#12291 #[tracing::instrument()] triggers infinite_loop

Added an in_external_macro and is_from_proc_macro check to the [infinite_loop] lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants