-
Notifications
You must be signed in to change notification settings - Fork 13.4k
move Unicode functionality from libcore/collections/regex into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 #15283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -355,6 +355,13 @@ pub fn len_utf8_bytes(c: char) -> uint { | |||
} | |||
} | |||
|
|||
/// Returns the displayed width of this `char`, or `None` if this is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Displayed width" in what? Pixels? Microsantas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just not have this free function, btw, and only have the method. UFCS is close enough, and Char::width
is nice enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch on the units.
I'll go ahead and pull char::width and just leave the method.
How about 'A' (Ambiguous) characters? |
This implementation returns UAX#11, §5, makes recommendations for three situations.
Case 3 seems the closest to our purposes in providing such a function in the standard library, and case 1 the furthest. Absent explicit locale support, it seems to me that treating 'A' as narrow is the correct decision. This behavior is also consistent with the notes in Markus Kuhn's implementation. Above the alternate
We could provide an alternate function, e.g., |
Agreed. I prefer a flag parameter (e.g. |
OK, I've updated to include support for CJK. Also added some additional updates in
|
I'm a little worried about adding more and more unicode-ness into libcore, especially when you think of libcore being the core of all rust code in existence. It seems a little odd to put That being said, this definitely seems like a useful property to have, and it should likely have a home somewhere. Additionally, there's already a bunch of other unicode stuff in libcore anyway right now so this may not be that far off from home? I suppose I'm a little uneasy about functionality like this going into libcore, and I would be more comfortable with it going elsewhere, but I'd have to think about where that elsewhere should be. |
Perhaps a child trait of I'm not even sure whether it's possible to make this work out seamlessly, or if it might get slightly messy. Another thing that would be nice is to pull the unicode module out of libcollections, but that might be an orthogonal concern. Even if that is the eventual direction, this functionality might be a nice stepping stone until we get there. If you like the above idea, I can experiment with stripping down libcore's (Actually, is this even possible? Can an |
That sounds great! I would love for all real-deal unicode functionality go into a libunicode (perhaps underneath libstd, perhaps not). Things like uppercase/lowercase or utf8 conversions may need to remain in core, but those are much smaller than a bunch of tables I think. I think that your idea is definitely in the right direction, and stripping down |
Awesome. Let me look into that. Would you prefer that I drop this PR until then, or should we get this integrated and into the world while we formulate a unified plan of attack? Also, note that capitalization in the general case does require lookup tables, so it might be that we can't fully support it in libcore if the goal is ridding ourselves of lookup tables. I need to do a more thorough study of this before proposing a concrete plan, though. |
I agree with |
@alexcrichton I gave some thought to how we would disentangle Unicode from libcore, and while I believe it is doable, I have become convinced that doing it right now, piecemeal, is the wrong approach.
Recognizing the oft-quoted anatomical counterpart to opinions, I nevertheless submit to you that the project of trimming libcore is probably a full RFC and accompanying discussion in its own right, and that---especially since almost all of the code in this PR is autogenerated by unicode.py---adding this functionality to libcore as it stands now is in no way detrimental to future efforts in that vein. Toward the goal of slimming libcore, I am certainly willing to undertake some experimentation in terms of contributors to total library size. Once we have that data, perhaps an open discussion (in an issue? an RFC?) on libcore structure will be fruitful. |
1.0 is mainly aiming to stabilise the language, library changes are more flexible, especially the internals of libcore etc., furthermore, we are using stability attributes for fine-grained control over what is considered immutable and what is subject to change. |
Cool, so this re-org effort can be delayed a little bit, then. I'll start putting together some stats anyway. |
Couldn't a linker strip out unused pieces of the library? I'm sorry if my thought is irrelevant. |
It does do that. Also, if you're measuring the rlib, a large portion of that is library metadata, which doesn't appear in executables. |
I'm not primarily concerned about the actual physical size of libcore, but more of the "api surface layer" of libcore. Doing unicode correctly is something I've always heard is quite tricky, and the functionality morally doesn't belong in libcore in my opinion. @huonw is right. The object file itself (on my machine) is 981K, the bytecode is 384K, and the metadata is 23MB. If this does end up in a large restructuring, then I would definitely recommend an RFC, but @huonw is also right that we're free to experiment with structures of all modules underneath the standard library with much more free reign than the standard library itself. |
OK, new crate is called libunicode. I moved most of the functions in the Char trait in libcore to the UnicodeChar trait.
I left the following functions in libcore:
I also moved some of the
Finally, the unicode modules from libcollections and libcore are now consolidated in the tables module under libunicode. libcollections was using libcore's unicode support anyway, so there was really no sense in keeping the two separate. I added For things above libstd, this change should be completely seamless, except that there are two new traits, UnicodeStrSlice and UnicodeChar, which provide some of the functionality previously provided by Char and StrSlice. Since |
Drive by comment: the PR title and description should be (somewhat) updated before this lands. |
Thanks, good point. Done. |
@kwantam maybe |
Yes, that's also possible. The quick and obvious way is to just move the module directly over, but I'll take a careful look to see if we can't eliminate duplicated data between the two sets of tables. Given the choice, I'm going to prefer performance to codesize unless told otherwise. |
OK, moved the unicode tables from libregex and unified the functionality of both scripts ( I also moved the decomposition functions into their own module inside libunicode; no reason to have the code for those functions sitting inside a python script. Finally, I moved the export point for the decomposition functions into unicode::char, which makes them available from std::char as had previously been the case. |
Anyone have any opinions on whether it's useful to export functions that check a character against a given class? We could export functions with names corresponding to each Unicode general category. There are already some of these used internally by the UnicodeChar trait implementation, but perhaps people would find these useful? @BurntSushi in moving over the Unicode tables from the libregex module I changed the definition of the PERLW table slightly to comply with UTS#18, FYI. (Also, you write much more beautiful Python than I do, so I'm sorry to have consumed your generator script...) |
@kwantam Very nice work! Your changes to And one generator script is probably better than two generator scripts, so kudos. :-) |
Thanks! Updated the documentation to reflect how the character classes are generated. (I added a link to UTS#18 and made the |
The functionality included here is only that which is necessary to | ||
provide for basic string-related manipulations. This crate does not | ||
(yet) aim to provide a full set of Unicode tables. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our crate documentation traditionally uses //!
as the comment-style for the crate comment.
This is looking pretty nice! The commit log will need some revisiting before merging, and when doing so remember to put @aturon, @brson, do either of you have thoughts on this split? I like this for a few reasons:
Some downsides which I could think of:
|
Made updates as requested. I'm going to squash everything into one commit with the content of my initial comment, above. (ETA: added |
Not sure why locale is mentioned -- the concept of context for Ambiguous characters seems to be much more about the detailed context, i.e. if a section of text is marked as a specific language, which font is used, or which characters are adjacent. As it says Ambiguous quotes characters are narrow when enclosing narrow text and wide in the other case; it doesn't depend on locale. |
Ah, good point. I'll rephrase to address this. |
- created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does *not* impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change]
@alexcrichton I'm in favor of this change; it fits the spirit of the facade very well, and it's a pretty natural division. Neither of the downsides you mention worry me. In particular, it's not hard to construct an equivalent to today's |
Oops, needed to modify tests.mk since libunicode tests live in the libcoretest crate. |
On this. libcollections tests use some methods that are in UnicodeChar, and I guess they don't run on check-stage1? Fix should be quick, but I'll re-run locally to be sure that I didn't miss any. Well this is odd. The test module that failed imports std::prelude::*, so it should certainly have the UnicodeChar trait imported. Perhaps this is a globbing issue? (Still compiling...) Should be good to go now, I hope... |
How annoying. The test case should have failed, because the PERLW table now includes that character (in accordance with UTS#18, and perl agrees), but I'm confused why that test didn't run when I ran tests locally before the last commit. Anyway, updated the test to a character that is definitely not in the \w character class (verified against perl). Apologies for the waste of time and resources. I'll check in once tests finish on my end. Aha! Found it. Combination of turning off ipv6 in the kernel (making libstd tests fail) and running check with -j4 (making the failure non-obvious in the output file, because other tests kept running afterward). Shouldn't be an issue in the future, one hopes. |
- unicode tests live in coretest crate - libcollections str tests need UnicodeChar trait. - libregex perlw tests were checking a char in the Alphabetic category, \x2161. Confirmed perl 5.18 considers this a \w character. Changed to \x2961, which is not \w as the test expects.
@alexcrichton should be good to go now. Sorry for spinning the wheels a bit on the tests. |
Add libunicode; move unicode functions from core - created new crate, libunicode, below libstd - split `Char` trait into `Char` (libcore) and `UnicodeChar` (libunicode) - Unicode-aware functions now live in libunicode - `is_alphabetic`, `is_XID_start`, `is_XID_continue`, `is_lowercase`, `is_uppercase`, `is_whitespace`, `is_alphanumeric`, `is_control`, `is_digit`, `to_uppercase`, `to_lowercase` - added `width` method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split `StrSlice` into `StrSlice` (libcore) and `UnicodeStrSlice` (libunicode) - functionality formerly in `StrSlice` that relied upon Unicode functionality from `Char` is now in `UnicodeStrSlice` - `words`, `is_whitespace`, `is_alphanumeric`, `trim`, `trim_left`, `trim_right` - also moved `Words` type alias into libunicode because `words` method is in `UnicodeStrSlice` - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated `unicode.py` in `src/etc` to generate aforementioned tables - generated new tables based on latest Unicode data - added `UnicodeChar` and `UnicodeStrSlice` traits to prelude - libunicode is now the collection point for the `std::char` module, combining the libunicode functionality with the `Char` functionality from libcore - thus, moved doc comment for `char` from `core::char` to `unicode::char` - libcollections remains the collection point for `std::str` The Unicode-aware functions that previously lived in the `Char` and `StrSlice` traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and `use` the `UnicodeChar` and/or `UnicodeStrSlice` traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does *not* impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change]
Add libunicode; move unicode functions from core
Char
trait intoChar
(libcore) andUnicodeChar
(libunicode)is_alphabetic
,is_XID_start
,is_XID_continue
,is_lowercase
,is_uppercase
,is_whitespace
,is_alphanumeric
,is_control
,is_digit
,to_uppercase
,to_lowercase
width
method in UnicodeChar traitStrSlice
intoStrSlice
(libcore) andUnicodeStrSlice
(libunicode)StrSlice
that relied upon Unicode functionality fromChar
is now inUnicodeStrSlice
words
,is_whitespace
,is_alphanumeric
,trim
,trim_left
,trim_right
Words
type alias into libunicode becausewords
method is inUnicodeStrSlice
unicode.py
insrc/etc
to generate aforementioned tablesUnicodeChar
andUnicodeStrSlice
traits to preludestd::char
module, combining the libunicode functionality with theChar
functionality from libcorechar
fromcore::char
tounicode::char
std::str
The Unicode-aware functions that previously lived in the
Char
andStrSlice
traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate anduse
theUnicodeChar
and/orUnicodeStrSlice
traits:NOTE: this does not impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]