move Unicode functionality from libcore/collections/regex into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 #15283

kwantam · 2014-06-30T21:32:13Z

Add libunicode; move unicode functions from core

created new crate, libunicode, below libstd
split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
  - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
    is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit,
    to_uppercase, to_lowercase
- added width method in UnicodeChar trait
  - determines printed width of character in columns, or None if it is a non-NULL control character
  - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise)
split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode)
- functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice
  - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is in UnicodeStrSlice
unified Unicode tables from libcollections, libcore, and libregex into libunicode
updated unicode.py in src/etc to generate aforementioned tables
generated new tables based on latest Unicode data
added UnicodeChar and UnicodeStrSlice traits to prelude
libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore
- thus, moved doc comment for char from core::char to unicode::char
libcollections remains the collection point for std::str

The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits:

extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method

NOTE: this does not impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude.

closes #15224
[breaking-change]

emberian · 2014-06-30T23:54:22Z

src/libcore/char.rs

@@ -355,6 +355,13 @@ pub fn len_utf8_bytes(c: char) -> uint {
    }
 }

+/// Returns the displayed width of this `char`, or `None` if this is a


"Displayed width" in what? Pixels? Microsantas?

We should just not have this free function, btw, and only have the method. UFCS is close enough, and Char::width is nice enough.

Ah, good catch on the units.

I'll go ahead and pull char::width and just leave the method.

omasanori · 2014-07-01T04:36:09Z

How about 'A' (Ambiguous) characters?

kwantam · 2014-07-01T05:05:22Z

This implementation returns Some(1) for Ambiguous characters.

UAX#11, §5, makes recommendations for three situations.

When mapping Unicode to East Asian legacy character encodings
- treat 'A' width as fullwidth (2 column) characters
When mapping Unicode to non-East Asian legacy character encodings
- treat 'A' width as narrow (1 column) characters
When processing or displaying data
- treat 'A' width based on the present context, or, if the context cannot be reliably determined, treat 'A' width as narrow.

Case 3 seems the closest to our purposes in providing such a function in the standard library, and case 1 the furthest. Absent explicit locale support, it seems to me that treating 'A' as narrow is the correct decision.

This behavior is also consistent with the notes in Markus Kuhn's implementation. Above the alternate mk_wcwidth_cjk() function, he says:

 * The following functions are the same as mk_wcwidth() and
 * mk_wcswidth(), except that spacing characters in the East Asian
 * Ambiguous (A) category as defined in Unicode Technical Report #11
 * have a column width of 2. This variant might be useful for users of
 * CJK legacy encodings who want to migrate to UCS without changing
 * the traditional terminal character-width behaviour. It is not
 * otherwise recommended for general use.

We could provide an alternate function, e.g., width_cjk(), which returns an appropriate value for CJK locales; then it is up to the programmer to choose the correct width based on her locale. If this seems desirable, I can add another table and corresponding function to the charwidth module.

omasanori · 2014-07-01T07:19:12Z

Agreed. I prefer a flag parameter (e.g. width(AMBIGUOUS_AS_WIDE)) to an alternative just for CJK, but I think it's trivial and matter of taste.

kwantam · 2014-07-01T18:29:20Z

OK, I've updated to include support for CJK. width() takes is_cjk: bool.

Also added some additional updates in unicode.py:

minor optimization to the table, collapsing adjacent spans when possible
rewrote the function that extracts info from EastAsianWidth.txt to exclude Me, Mn, and Cf categories when extracting W, F, and A

alexcrichton · 2014-07-01T19:27:32Z

I'm a little worried about adding more and more unicode-ness into libcore, especially when you think of libcore being the core of all rust code in existence. It seems a little odd to put char.width() on the same level as slice.len().

That being said, this definitely seems like a useful property to have, and it should likely have a home somewhere. Additionally, there's already a bunch of other unicode stuff in libcore anyway right now so this may not be that far off from home?

I suppose I'm a little uneasy about functionality like this going into libcore, and I would be more comfortable with it going elsewhere, but I'd have to think about where that elsewhere should be.

kwantam · 2014-07-01T19:48:10Z

Perhaps a child trait of Char called UnicodeChar, the former having only barebones support for non-ASCII characters, and the latter having full Unicode support? Then UnicodeChar could live in (e.g.) libstd.

I'm not even sure whether it's possible to make this work out seamlessly, or if it might get slightly messy.

Another thing that would be nice is to pull the unicode module out of libcollections, but that might be an orthogonal concern.

Even if that is the eventual direction, this functionality might be a nice stepping stone until we get there. If you like the above idea, I can experiment with stripping down libcore's Char implementation and moving the advanced functionality into UnicodeChar.

(Actually, is this even possible? Can an impl for char live outside of libcore? Hrm.)

alexcrichton · 2014-07-01T19:52:27Z

That sounds great! I would love for all real-deal unicode functionality go into a libunicode (perhaps underneath libstd, perhaps not). Things like uppercase/lowercase or utf8 conversions may need to remain in core, but those are much smaller than a bunch of tables I think.

I think that your idea is definitely in the right direction, and stripping down Char may work out well. You can't have an inherent impl char { ... }, but you can have impl Trait for char { ... } (like the Char trait).

kwantam · 2014-07-01T19:56:53Z

Awesome. Let me look into that.

Would you prefer that I drop this PR until then, or should we get this integrated and into the world while we formulate a unified plan of attack?

Also, note that capitalization in the general case does require lookup tables, so it might be that we can't fully support it in libcore if the goal is ridding ourselves of lookup tables. I need to do a more thorough study of this before proposing a concrete plan, though.

omasanori · 2014-07-01T23:12:53Z

I agree with libunicode. Indeed the standard library should provide proper text handling, it sounds good to separate advanced (well, it's essential for someone, though) functionality from libcore as long as a subset in libcore does things right.

kwantam · 2014-07-02T03:41:51Z

@alexcrichton I gave some thought to how we would disentangle Unicode from libcore, and while I believe it is doable, I have become convinced that doing it right now, piecemeal, is the wrong approach.

At the moment, libcore is a bit of a sprawl---26MB on my machine!---and I'd be surprised if the Unicode tables account for even 0.1% of that.
- For example, the charwidths table has about 500 entries at 10 bytes per entry.
As a result, moving Unicode-related Char functionality out of libcore will, by itself, have almost no impact on the size of libcore.
Thus, if we want to clean up libcore, other cleanup will need to happen as well.
- Given the present size of libcore, my feeling is that this will be a major effort driven by hard data: how much does each feature in libcore cost, and how easy is it to move that feature into a non-core library?
  - This will almost certainly be a breaking change, and thus needs to happen before 1.0.
- The rudimentary Unicode support presently in libcore is, in my mind, unlikely to be even in the top 10 submodules in terms of how much it adds to libcore's size.
  - So excising it from libcore does almost nothing to address the primary concern.
  - ...and it probably doesn't have much benefit in terms of reducing future work slimming down libcore.

Recognizing the oft-quoted anatomical counterpart to opinions, I nevertheless submit to you that the project of trimming libcore is probably a full RFC and accompanying discussion in its own right, and that---especially since almost all of the code in this PR is autogenerated by unicode.py---adding this functionality to libcore as it stands now is in no way detrimental to future efforts in that vein.

Toward the goal of slimming libcore, I am certainly willing to undertake some experimentation in terms of contributors to total library size. Once we have that data, perhaps an open discussion (in an issue? an RFC?) on libcore structure will be fruitful.

huonw · 2014-07-02T04:04:45Z

This will almost certainly be a breaking change, and thus needs to happen before 1.0.

1.0 is mainly aiming to stabilise the language, library changes are more flexible, especially the internals of libcore etc., furthermore, we are using stability attributes for fine-grained control over what is considered immutable and what is subject to change.

kwantam · 2014-07-02T04:28:39Z

library changes are more flexible

Cool, so this re-org effort can be delayed a little bit, then. I'll start putting together some stats anyway.

omasanori · 2014-07-02T08:22:13Z

libcore is a bit of a sprawl---26MB on my machine!

Couldn't a linker strip out unused pieces of the library? I'm sorry if my thought is irrelevant.

huonw · 2014-07-02T08:54:55Z

It does do that. Also, if you're measuring the rlib, a large portion of that is library metadata, which doesn't appear in executables.

alexcrichton · 2014-07-02T13:09:41Z

I'm not primarily concerned about the actual physical size of libcore, but more of the "api surface layer" of libcore. Doing unicode correctly is something I've always heard is quite tricky, and the functionality morally doesn't belong in libcore in my opinion.

@huonw is right. The object file itself (on my machine) is 981K, the bytecode is 384K, and the metadata is 23MB.

If this does end up in a large restructuring, then I would definitely recommend an RFC, but @huonw is also right that we're free to experiment with structures of all modules underneath the standard library with much more free reign than the standard library itself.

kwantam · 2014-07-03T02:32:55Z

OK, new crate is called libunicode.

I moved most of the functions in the Char trait in libcore to the UnicodeChar trait.

is_alphabetic
is_XID_start
is_XID_continue
is_lowercase
is_uppercase
is_whitespace
is_alphanumeric
is_control
is_digit
to_lowercase
to_uppercase
width

I left the following functions in libcore:

is_digit_radix - this is not a Unicode-aware function
to_digit - ditto
from_digit - ditto
escape_unicode - I hesitated on this one, but it's not really aware of anything beyond the particular escape sequences we use to represent the char datatype, so I thought it should stay here.
escape_default - if the above stays, certainly this one does too
len_utf8_bytes - low-level representation--related
encode_utf8 - ditto
encode_utf16 - ditto

I also moved some of the StrSlice methods into UnicodeStrSlice, specifically, the ones that rely on methods provided by UnicodeChar:

words
is_whitespace
is_alphanumeric
trim
trim_left
trim_right
the Words type alias is also in libunicode, since it is the return type for words. Strictly speaking, this could stay in StrSlice, but I don't see the point, really. Nothing below libstd uses it anyway.

Finally, the unicode modules from libcollections and libcore are now consolidated in the tables module under libunicode. libcollections was using libcore's unicode support anyway, so there was really no sense in keeping the two separate.

I added UnicodeStrSlice and UnicodeChar to the prelude, and I changed libstd so that it re-exports the char module from libunicode rather than from libcore (and libunicode re-exports the public elements of libcore so that this is seamless). libcollections provides std::str as before, with a few elements re-exported from libunicode.

For things above libstd, this change should be completely seamless, except that there are two new traits, UnicodeStrSlice and UnicodeChar, which provide some of the functionality previously provided by Char and StrSlice. Since char and &str are the only implementors of these traits (respectively) this should be a no-op for almost everyone.

huonw · 2014-07-03T03:48:32Z

Drive by comment: the PR title and description should be (somewhat) updated before this lands.

kwantam · 2014-07-03T05:29:46Z

Thanks, good point. Done.

schmee · 2014-07-03T07:24:15Z

@kwantam maybe libregex/unicode.rs can move into libunicode as well?

kwantam · 2014-07-03T07:56:07Z

Yes, that's also possible. The quick and obvious way is to just move the module directly over, but I'll take a careful look to see if we can't eliminate duplicated data between the two sets of tables.

Given the choice, I'm going to prefer performance to codesize unless told otherwise.

kwantam · 2014-07-03T23:54:58Z

OK, moved the unicode tables from libregex and unified the functionality of both scripts (regex-unicode-tables.py and unicode.py) into unicode.py.

I also moved the decomposition functions into their own module inside libunicode; no reason to have the code for those functions sitting inside a python script.

Finally, I moved the export point for the decomposition functions into unicode::char, which makes them available from std::char as had previously been the case.

kwantam · 2014-07-04T00:14:55Z

Anyone have any opinions on whether it's useful to export functions that check a character against a given class? We could export functions with names corresponding to each Unicode general category. There are already some of these used internally by the UnicodeChar trait implementation, but perhaps people would find these useful?

@BurntSushi in moving over the Unicode tables from the libregex module I changed the definition of the PERLW table slightly to comply with UTS#18, FYI. (Also, you write much more beautiful Python than I do, so I'm sorry to have consumed your generator script...)

BurntSushi · 2014-07-04T02:37:58Z

@kwantam Very nice work! Your changes to \w look fine to me. Thank you for doing that. I would ask that you update the documentation though to be more accurate. (I'm thinking of this section in particular.)

And one generator script is probably better than two generator scripts, so kudos. :-)

kwantam · 2014-07-04T03:03:10Z

Thanks! Updated the documentation to reflect how the character classes are generated.

(I added a link to UTS#18 and made the \d, \w, and \s definitions more accurately reflect this. The only one that actually changed as a result of this was \w.)

alexcrichton · 2014-07-05T19:55:51Z

src/libunicode/lib.rs

+  The functionality included here is only that which is necessary to
+  provide for basic string-related manipulations. This crate does not
+  (yet) aim to provide a full set of Unicode tables.
+*/


Our crate documentation traditionally uses //! as the comment-style for the crate comment.

alexcrichton · 2014-07-05T20:05:18Z

This is looking pretty nice! The commit log will need some revisiting before merging, and when doing so remember to put [breaking-change] where appropriate.

@aturon, @brson, do either of you have thoughts on this split? I like this for a few reasons:

Adding new unicode functionality has a clear location at which it should be added libunicode
There's a clear split between what needs unicode and what doesn't.
All unicode modules scattered across our crates are now consolidated into one.
All usage of the unicode crate itself is #[experimental] (as it should be, unless it's reexported)
I believe that rustdoc does a good enough job with inlining/presentation that this won't degrade the quality of documentation too much.

Some downsides which I could think of:

The privileged name libunicode is taken. I believe this to be ok because as part of the facade we're at liberty to rename things as we see fit.
Common functions like is_whitespace are harder to access if you're only using libcore. I think this isn't so bad.

kwantam · 2014-07-05T20:29:02Z

Made updates as requested. I'm going to squash everything into one commit with the content of my initial comment, above.

(ETA: added #![crate_name] as well)
(E2: forgot #![allow(unused_attribute)])

bluss · 2014-07-07T18:23:45Z

Not sure why locale is mentioned -- the concept of context for Ambiguous characters seems to be much more about the detailed context, i.e. if a section of text is marked as a specific language, which font is used, or which characters are adjacent. As it says Ambiguous quotes characters are narrow when enclosing narrow text and wide in the other case; it doesn't depend on locale.

kwantam · 2014-07-07T18:48:38Z

Ah, good point. I'll rephrase to address this.

- created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does *not* impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change]

aturon · 2014-07-08T05:03:05Z

@alexcrichton I'm in favor of this change; it fits the spirit of the facade very well, and it's a pretty natural division.

Neither of the downsides you mention worry me. In particular, it's not hard to construct an equivalent to today's libcore if you so desire.

kwantam · 2014-07-08T18:49:44Z

Oops, needed to modify tests.mk since libunicode tests live in the libcoretest crate.

kwantam · 2014-07-08T23:00:40Z

On this. libcollections tests use some methods that are in UnicodeChar, and I guess they don't run on check-stage1? Fix should be quick, but I'll re-run locally to be sure that I didn't miss any.

Well this is odd. The test module that failed imports std::prelude::*, so it should certainly have the UnicodeChar trait imported. Perhaps this is a globbing issue? (Still compiling...)

Should be good to go now, I hope...

kwantam · 2014-07-09T12:43:26Z

How annoying. The test case should have failed, because the PERLW table now includes that character (in accordance with UTS#18, and perl agrees), but I'm confused why that test didn't run when I ran tests locally before the last commit.

Anyway, updated the test to a character that is definitely not in the \w character class (verified against perl). Apologies for the waste of time and resources. I'll check in once tests finish on my end.

Aha! Found it. Combination of turning off ipv6 in the kernel (making libstd tests fail) and running check with -j4 (making the failure non-obvious in the output file, because other tests kept running afterward). Shouldn't be an issue in the future, one hopes.

- unicode tests live in coretest crate - libcollections str tests need UnicodeChar trait. - libregex perlw tests were checking a char in the Alphabetic category, \x2161. Confirmed perl 5.18 considers this a \w character. Changed to \x2961, which is not \w as the test expects.

kwantam · 2014-07-09T14:17:48Z

@alexcrichton should be good to go now. Sorry for spinning the wheels a bit on the tests.

Add libunicode; move unicode functions from core - created new crate, libunicode, below libstd - split `Char` trait into `Char` (libcore) and `UnicodeChar` (libunicode) - Unicode-aware functions now live in libunicode - `is_alphabetic`, `is_XID_start`, `is_XID_continue`, `is_lowercase`, `is_uppercase`, `is_whitespace`, `is_alphanumeric`, `is_control`, `is_digit`, `to_uppercase`, `to_lowercase` - added `width` method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split `StrSlice` into `StrSlice` (libcore) and `UnicodeStrSlice` (libunicode) - functionality formerly in `StrSlice` that relied upon Unicode functionality from `Char` is now in `UnicodeStrSlice` - `words`, `is_whitespace`, `is_alphanumeric`, `trim`, `trim_left`, `trim_right` - also moved `Words` type alias into libunicode because `words` method is in `UnicodeStrSlice` - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated `unicode.py` in `src/etc` to generate aforementioned tables - generated new tables based on latest Unicode data - added `UnicodeChar` and `UnicodeStrSlice` traits to prelude - libunicode is now the collection point for the `std::char` module, combining the libunicode functionality with the `Char` functionality from libcore - thus, moved doc comment for `char` from `core::char` to `unicode::char` - libcollections remains the collection point for `std::str` The Unicode-aware functions that previously lived in the `Char` and `StrSlice` traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and `use` the `UnicodeChar` and/or `UnicodeStrSlice` traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does *not* impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change]

Arcterus mentioned this pull request Jun 30, 2014

fmt: Knuth-Plass implementation; common: add unicode char_width function uutils/coreutils#318

Merged

emberian reviewed Jun 30, 2014
View reviewed changes

kwantam mentioned this pull request Jul 1, 2014

fmt: updated tables for charwidth uutils/coreutils#336

Merged

kwantam changed the title ~~add width() method in Char ; update Unicode tables~~ move Unicode functionality from libcore/collections into libunicode ; add width() method ; update libunicode tables for Unicode 7.0 Jul 3, 2014

alexcrichton reviewed Jul 5, 2014
View reviewed changes

bors closed this Jul 9, 2014

bors merged commit 85e2bee into rust-lang:master Jul 9, 2014

SimonSapin mentioned this pull request Jun 17, 2018

Treat gc=No characters as numeric #51609

Merged

move Unicode functionality from libcore/collections/regex into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 #15283

move Unicode functionality from libcore/collections/regex into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 #15283

Uh oh!

Conversation

kwantam commented Jun 30, 2014

Uh oh!

emberian Jun 30, 2014

Choose a reason for hiding this comment

Uh oh!

emberian Jun 30, 2014

Choose a reason for hiding this comment

Uh oh!

kwantam Jul 1, 2014

Choose a reason for hiding this comment

Uh oh!

omasanori commented Jul 1, 2014

Uh oh!

kwantam commented Jul 1, 2014

Uh oh!

omasanori commented Jul 1, 2014

Uh oh!

kwantam commented Jul 1, 2014

Uh oh!

alexcrichton commented Jul 1, 2014

Uh oh!

kwantam commented Jul 1, 2014

Uh oh!

alexcrichton commented Jul 1, 2014

Uh oh!

kwantam commented Jul 1, 2014

Uh oh!

omasanori commented Jul 1, 2014

Uh oh!

kwantam commented Jul 2, 2014

Uh oh!

huonw commented Jul 2, 2014

Uh oh!

kwantam commented Jul 2, 2014

Uh oh!

omasanori commented Jul 2, 2014

Uh oh!

huonw commented Jul 2, 2014

Uh oh!

alexcrichton commented Jul 2, 2014

Uh oh!

kwantam commented Jul 3, 2014

Uh oh!

huonw commented Jul 3, 2014

Uh oh!

kwantam commented Jul 3, 2014

Uh oh!

schmee commented Jul 3, 2014

Uh oh!

kwantam commented Jul 3, 2014

Uh oh!

kwantam commented Jul 3, 2014

Uh oh!

kwantam commented Jul 4, 2014

Uh oh!

BurntSushi commented Jul 4, 2014

Uh oh!

kwantam commented Jul 4, 2014

Uh oh!

alexcrichton Jul 5, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Jul 5, 2014

Uh oh!

kwantam commented Jul 5, 2014

Uh oh!

bluss commented Jul 7, 2014

Uh oh!

kwantam commented Jul 7, 2014

Uh oh!

aturon commented Jul 8, 2014

Uh oh!

kwantam commented Jul 8, 2014

Uh oh!

kwantam commented Jul 8, 2014

Uh oh!