Skip to content

move Unicode functionality from libcore/collections/regex into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 #15283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 9, 2014

Conversation

kwantam
Copy link
Contributor

@kwantam kwantam commented Jun 30, 2014

Add libunicode; move unicode functions from core

  • created new crate, libunicode, below libstd
  • split Char trait into Char (libcore) and UnicodeChar (libunicode)
    • Unicode-aware functions now live in libunicode
      • is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
        is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit,
        to_uppercase, to_lowercase
    • added width method in UnicodeChar trait
      • determines printed width of character in columns, or None if it is a non-NULL control character
      • takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise)
  • split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode)
    • functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice
      • words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
    • also moved Words type alias into libunicode because words method is in UnicodeStrSlice
  • unified Unicode tables from libcollections, libcore, and libregex into libunicode
  • updated unicode.py in src/etc to generate aforementioned tables
  • generated new tables based on latest Unicode data
  • added UnicodeChar and UnicodeStrSlice traits to prelude
  • libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore
    • thus, moved doc comment for char from core::char to unicode::char
  • libcollections remains the collection point for std::str

The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits:

extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method

NOTE: this does not impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude.

closes #15224
[breaking-change]

@@ -355,6 +355,13 @@ pub fn len_utf8_bytes(c: char) -> uint {
}
}

/// Returns the displayed width of this `char`, or `None` if this is a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Displayed width" in what? Pixels? Microsantas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just not have this free function, btw, and only have the method. UFCS is close enough, and Char::width is nice enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch on the units.

I'll go ahead and pull char::width and just leave the method.

@omasanori
Copy link
Contributor

How about 'A' (Ambiguous) characters?

@kwantam
Copy link
Contributor Author

kwantam commented Jul 1, 2014

This implementation returns Some(1) for Ambiguous characters.

UAX#11, §5, makes recommendations for three situations.

  1. When mapping Unicode to East Asian legacy character encodings
    • treat 'A' width as fullwidth (2 column) characters
  2. When mapping Unicode to non-East Asian legacy character encodings
    • treat 'A' width as narrow (1 column) characters
  3. When processing or displaying data
    • treat 'A' width based on the present context, or, if the context cannot be reliably determined, treat 'A' width as narrow.

Case 3 seems the closest to our purposes in providing such a function in the standard library, and case 1 the furthest. Absent explicit locale support, it seems to me that treating 'A' as narrow is the correct decision.

This behavior is also consistent with the notes in Markus Kuhn's implementation. Above the alternate mk_wcwidth_cjk() function, he says:

 * The following functions are the same as mk_wcwidth() and
 * mk_wcswidth(), except that spacing characters in the East Asian
 * Ambiguous (A) category as defined in Unicode Technical Report #11
 * have a column width of 2. This variant might be useful for users of
 * CJK legacy encodings who want to migrate to UCS without changing
 * the traditional terminal character-width behaviour. It is not
 * otherwise recommended for general use.

We could provide an alternate function, e.g., width_cjk(), which returns an appropriate value for CJK locales; then it is up to the programmer to choose the correct width based on her locale. If this seems desirable, I can add another table and corresponding function to the charwidth module.

@omasanori
Copy link
Contributor

Agreed. I prefer a flag parameter (e.g. width(AMBIGUOUS_AS_WIDE)) to an alternative just for CJK, but I think it's trivial and matter of taste.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 1, 2014

OK, I've updated to include support for CJK. width() takes is_cjk: bool.

Also added some additional updates in unicode.py:

  • minor optimization to the table, collapsing adjacent spans when possible
  • rewrote the function that extracts info from EastAsianWidth.txt to exclude Me, Mn, and Cf categories when extracting W, F, and A

@alexcrichton
Copy link
Member

I'm a little worried about adding more and more unicode-ness into libcore, especially when you think of libcore being the core of all rust code in existence. It seems a little odd to put char.width() on the same level as slice.len().

That being said, this definitely seems like a useful property to have, and it should likely have a home somewhere. Additionally, there's already a bunch of other unicode stuff in libcore anyway right now so this may not be that far off from home?

I suppose I'm a little uneasy about functionality like this going into libcore, and I would be more comfortable with it going elsewhere, but I'd have to think about where that elsewhere should be.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 1, 2014

Perhaps a child trait of Char called UnicodeChar, the former having only barebones support for non-ASCII characters, and the latter having full Unicode support? Then UnicodeChar could live in (e.g.) libstd.

I'm not even sure whether it's possible to make this work out seamlessly, or if it might get slightly messy.

Another thing that would be nice is to pull the unicode module out of libcollections, but that might be an orthogonal concern.

Even if that is the eventual direction, this functionality might be a nice stepping stone until we get there. If you like the above idea, I can experiment with stripping down libcore's Char implementation and moving the advanced functionality into UnicodeChar.

(Actually, is this even possible? Can an impl for char live outside of libcore? Hrm.)

@alexcrichton
Copy link
Member

That sounds great! I would love for all real-deal unicode functionality go into a libunicode (perhaps underneath libstd, perhaps not). Things like uppercase/lowercase or utf8 conversions may need to remain in core, but those are much smaller than a bunch of tables I think.

I think that your idea is definitely in the right direction, and stripping down Char may work out well. You can't have an inherent impl char { ... }, but you can have impl Trait for char { ... } (like the Char trait).

@kwantam
Copy link
Contributor Author

kwantam commented Jul 1, 2014

Awesome. Let me look into that.

Would you prefer that I drop this PR until then, or should we get this integrated and into the world while we formulate a unified plan of attack?

Also, note that capitalization in the general case does require lookup tables, so it might be that we can't fully support it in libcore if the goal is ridding ourselves of lookup tables. I need to do a more thorough study of this before proposing a concrete plan, though.

@omasanori
Copy link
Contributor

I agree with libunicode. Indeed the standard library should provide proper text handling, it sounds good to separate advanced (well, it's essential for someone, though) functionality from libcore as long as a subset in libcore does things right.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 2, 2014

@alexcrichton I gave some thought to how we would disentangle Unicode from libcore, and while I believe it is doable, I have become convinced that doing it right now, piecemeal, is the wrong approach.

  • At the moment, libcore is a bit of a sprawl---26MB on my machine!---and I'd be surprised if the Unicode tables account for even 0.1% of that.
    • For example, the charwidths table has about 500 entries at 10 bytes per entry.
  • As a result, moving Unicode-related Char functionality out of libcore will, by itself, have almost no impact on the size of libcore.
  • Thus, if we want to clean up libcore, other cleanup will need to happen as well.
    • Given the present size of libcore, my feeling is that this will be a major effort driven by hard data: how much does each feature in libcore cost, and how easy is it to move that feature into a non-core library?
      • This will almost certainly be a breaking change, and thus needs to happen before 1.0.
    • The rudimentary Unicode support presently in libcore is, in my mind, unlikely to be even in the top 10 submodules in terms of how much it adds to libcore's size.
      • So excising it from libcore does almost nothing to address the primary concern.
      • ...and it probably doesn't have much benefit in terms of reducing future work slimming down libcore.

Recognizing the oft-quoted anatomical counterpart to opinions, I nevertheless submit to you that the project of trimming libcore is probably a full RFC and accompanying discussion in its own right, and that---especially since almost all of the code in this PR is autogenerated by unicode.py---adding this functionality to libcore as it stands now is in no way detrimental to future efforts in that vein.

Toward the goal of slimming libcore, I am certainly willing to undertake some experimentation in terms of contributors to total library size. Once we have that data, perhaps an open discussion (in an issue? an RFC?) on libcore structure will be fruitful.

@huonw
Copy link
Member

huonw commented Jul 2, 2014

This will almost certainly be a breaking change, and thus needs to happen before 1.0.

1.0 is mainly aiming to stabilise the language, library changes are more flexible, especially the internals of libcore etc., furthermore, we are using stability attributes for fine-grained control over what is considered immutable and what is subject to change.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 2, 2014

library changes are more flexible

Cool, so this re-org effort can be delayed a little bit, then. I'll start putting together some stats anyway.

@omasanori
Copy link
Contributor

libcore is a bit of a sprawl---26MB on my machine!

Couldn't a linker strip out unused pieces of the library? I'm sorry if my thought is irrelevant.

@huonw
Copy link
Member

huonw commented Jul 2, 2014

It does do that. Also, if you're measuring the rlib, a large portion of that is library metadata, which doesn't appear in executables.

@alexcrichton
Copy link
Member

I'm not primarily concerned about the actual physical size of libcore, but more of the "api surface layer" of libcore. Doing unicode correctly is something I've always heard is quite tricky, and the functionality morally doesn't belong in libcore in my opinion.

@huonw is right. The object file itself (on my machine) is 981K, the bytecode is 384K, and the metadata is 23MB.

If this does end up in a large restructuring, then I would definitely recommend an RFC, but @huonw is also right that we're free to experiment with structures of all modules underneath the standard library with much more free reign than the standard library itself.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 3, 2014

OK, new crate is called libunicode.

I moved most of the functions in the Char trait in libcore to the UnicodeChar trait.

  • is_alphabetic
  • is_XID_start
  • is_XID_continue
  • is_lowercase
  • is_uppercase
  • is_whitespace
  • is_alphanumeric
  • is_control
  • is_digit
  • to_lowercase
  • to_uppercase
  • width

I left the following functions in libcore:

  • is_digit_radix - this is not a Unicode-aware function
  • to_digit - ditto
  • from_digit - ditto
  • escape_unicode - I hesitated on this one, but it's not really aware of anything beyond the particular escape sequences we use to represent the char datatype, so I thought it should stay here.
  • escape_default - if the above stays, certainly this one does too
  • len_utf8_bytes - low-level representation--related
  • encode_utf8 - ditto
  • encode_utf16 - ditto

I also moved some of the StrSlice methods into UnicodeStrSlice, specifically, the ones that rely on methods provided by UnicodeChar:

  • words
  • is_whitespace
  • is_alphanumeric
  • trim
  • trim_left
  • trim_right
  • the Words type alias is also in libunicode, since it is the return type for words. Strictly speaking, this could stay in StrSlice, but I don't see the point, really. Nothing below libstd uses it anyway.

Finally, the unicode modules from libcollections and libcore are now consolidated in the tables module under libunicode. libcollections was using libcore's unicode support anyway, so there was really no sense in keeping the two separate.

I added UnicodeStrSlice and UnicodeChar to the prelude, and I changed libstd so that it re-exports the char module from libunicode rather than from libcore (and libunicode re-exports the public elements of libcore so that this is seamless). libcollections provides std::str as before, with a few elements re-exported from libunicode.

For things above libstd, this change should be completely seamless, except that there are two new traits, UnicodeStrSlice and UnicodeChar, which provide some of the functionality previously provided by Char and StrSlice. Since char and &str are the only implementors of these traits (respectively) this should be a no-op for almost everyone.

@huonw
Copy link
Member

huonw commented Jul 3, 2014

Drive by comment: the PR title and description should be (somewhat) updated before this lands.

@kwantam kwantam changed the title add width() method in Char ; update Unicode tables move Unicode functionality from libcore/collections into libunicode ; add width() method ; update libunicode tables for Unicode 7.0 Jul 3, 2014
@kwantam kwantam changed the title move Unicode functionality from libcore/collections into libunicode ; add width() method ; update libunicode tables for Unicode 7.0 move Unicode functionality from libcore/collections into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 Jul 3, 2014
@kwantam
Copy link
Contributor Author

kwantam commented Jul 3, 2014

Thanks, good point. Done.

@schmee
Copy link
Contributor

schmee commented Jul 3, 2014

@kwantam maybe libregex/unicode.rs can move into libunicode as well?

@kwantam
Copy link
Contributor Author

kwantam commented Jul 3, 2014

Yes, that's also possible. The quick and obvious way is to just move the module directly over, but I'll take a careful look to see if we can't eliminate duplicated data between the two sets of tables.

Given the choice, I'm going to prefer performance to codesize unless told otherwise.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 3, 2014

OK, moved the unicode tables from libregex and unified the functionality of both scripts (regex-unicode-tables.py and unicode.py) into unicode.py.

I also moved the decomposition functions into their own module inside libunicode; no reason to have the code for those functions sitting inside a python script.

Finally, I moved the export point for the decomposition functions into unicode::char, which makes them available from std::char as had previously been the case.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 4, 2014

Anyone have any opinions on whether it's useful to export functions that check a character against a given class? We could export functions with names corresponding to each Unicode general category. There are already some of these used internally by the UnicodeChar trait implementation, but perhaps people would find these useful?

@BurntSushi in moving over the Unicode tables from the libregex module I changed the definition of the PERLW table slightly to comply with UTS#18, FYI. (Also, you write much more beautiful Python than I do, so I'm sorry to have consumed your generator script...)

@kwantam kwantam changed the title move Unicode functionality from libcore/collections into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 move Unicode functionality from libcore/collections/regex into libunicode ; add width() method for char ; update libunicode tables for Unicode 7.0 Jul 4, 2014
@BurntSushi
Copy link
Member

@kwantam Very nice work! Your changes to \w look fine to me. Thank you for doing that. I would ask that you update the documentation though to be more accurate. (I'm thinking of this section in particular.)

And one generator script is probably better than two generator scripts, so kudos. :-)

@kwantam
Copy link
Contributor Author

kwantam commented Jul 4, 2014

Thanks! Updated the documentation to reflect how the character classes are generated.

(I added a link to UTS#18 and made the \d, \w, and \s definitions more accurately reflect this. The only one that actually changed as a result of this was \w.)

The functionality included here is only that which is necessary to
provide for basic string-related manipulations. This crate does not
(yet) aim to provide a full set of Unicode tables.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our crate documentation traditionally uses //! as the comment-style for the crate comment.

@alexcrichton
Copy link
Member

This is looking pretty nice! The commit log will need some revisiting before merging, and when doing so remember to put [breaking-change] where appropriate.

@aturon, @brson, do either of you have thoughts on this split? I like this for a few reasons:

  • Adding new unicode functionality has a clear location at which it should be added libunicode
  • There's a clear split between what needs unicode and what doesn't.
  • All unicode modules scattered across our crates are now consolidated into one.
  • All usage of the unicode crate itself is #[experimental] (as it should be, unless it's reexported)
  • I believe that rustdoc does a good enough job with inlining/presentation that this won't degrade the quality of documentation too much.

Some downsides which I could think of:

  • The privileged name libunicode is taken. I believe this to be ok because as part of the facade we're at liberty to rename things as we see fit.
  • Common functions like is_whitespace are harder to access if you're only using libcore. I think this isn't so bad.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 5, 2014

Made updates as requested. I'm going to squash everything into one commit with the content of my initial comment, above.

(ETA: added #![crate_name] as well)
(E2: forgot #![allow(unused_attribute)])

@bluss
Copy link
Member

bluss commented Jul 7, 2014

Not sure why locale is mentioned -- the concept of context for Ambiguous characters seems to be much more about the detailed context, i.e. if a section of text is marked as a specific language, which font is used, or which characters are adjacent. As it says Ambiguous quotes characters are narrow when enclosing narrow text and wide in the other case; it doesn't depend on locale.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 7, 2014

Ah, good point. I'll rephrase to address this.

- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
  - Unicode-aware functions now live in libunicode
    - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
      is_uppercase, is_whitespace, is_alphanumeric, is_control,
      is_digit, to_uppercase, to_lowercase
  - added width method in UnicodeChar trait
    - determines printed width of character in columns, or None if it is
      a non-NULL control character
    - takes a boolean argument indicating whether the present context is
      CJK or not (characters with 'A'mbiguous widths are double-wide in
      CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
  (libunicode)
  - functionality formerly in StrSlice that relied upon Unicode
    functionality from Char is now in UnicodeStrSlice
    - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
  - also moved Words type alias into libunicode because words method is
    in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
  libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
  combining the libunicode functionality with the Char functionality
  from libcore
  - thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str

The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:

    extern crate unicode;
    use unicode::UnicodeChar;
    use unicode::UnicodeStrSlice;
    use unicode::Words; // if you want to use the words() method

NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.

closes #15224
[breaking-change]
@aturon
Copy link
Member

aturon commented Jul 8, 2014

@alexcrichton I'm in favor of this change; it fits the spirit of the facade very well, and it's a pretty natural division.

Neither of the downsides you mention worry me. In particular, it's not hard to construct an equivalent to today's libcore if you so desire.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 8, 2014

Oops, needed to modify tests.mk since libunicode tests live in the libcoretest crate.

@kwantam
Copy link
Contributor Author

kwantam commented Jul 8, 2014

On this. libcollections tests use some methods that are in UnicodeChar, and I guess they don't run on check-stage1? Fix should be quick, but I'll re-run locally to be sure that I didn't miss any.

Well this is odd. The test module that failed imports std::prelude::*, so it should certainly have the UnicodeChar trait imported. Perhaps this is a globbing issue? (Still compiling...)

Should be good to go now, I hope...

@kwantam
Copy link
Contributor Author

kwantam commented Jul 9, 2014

How annoying. The test case should have failed, because the PERLW table now includes that character (in accordance with UTS#18, and perl agrees), but I'm confused why that test didn't run when I ran tests locally before the last commit.

Anyway, updated the test to a character that is definitely not in the \w character class (verified against perl). Apologies for the waste of time and resources. I'll check in once tests finish on my end.

Aha! Found it. Combination of turning off ipv6 in the kernel (making libstd tests fail) and running check with -j4 (making the failure non-obvious in the output file, because other tests kept running afterward). Shouldn't be an issue in the future, one hopes.

- unicode tests live in coretest crate
- libcollections str tests need UnicodeChar trait.
- libregex perlw tests were checking a char in the Alphabetic category,
  \x2161. Confirmed perl 5.18 considers this a \w character. Changed to
  \x2961, which is not \w as the test expects.
@kwantam
Copy link
Contributor Author

kwantam commented Jul 9, 2014

@alexcrichton should be good to go now. Sorry for spinning the wheels a bit on the tests.

bors added a commit that referenced this pull request Jul 9, 2014
Add libunicode; move unicode functions from core

- created new crate, libunicode, below libstd
- split `Char` trait into `Char` (libcore) and `UnicodeChar` (libunicode)
  - Unicode-aware functions now live in libunicode
    - `is_alphabetic`, `is_XID_start`, `is_XID_continue`, `is_lowercase`,
      `is_uppercase`, `is_whitespace`, `is_alphanumeric`, `is_control`, `is_digit`,
      `to_uppercase`, `to_lowercase`
  - added `width` method in UnicodeChar trait
    - determines printed width of character in columns, or None if it is a non-NULL control character
    - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise)
- split `StrSlice` into `StrSlice` (libcore) and `UnicodeStrSlice` (libunicode)
  - functionality formerly in `StrSlice` that relied upon Unicode functionality from `Char` is now in `UnicodeStrSlice`
    - `words`, `is_whitespace`, `is_alphanumeric`, `trim`, `trim_left`, `trim_right`
  - also moved `Words` type alias into libunicode because `words` method is in `UnicodeStrSlice`
- unified Unicode tables from libcollections, libcore, and libregex into libunicode
- updated `unicode.py` in `src/etc` to generate aforementioned tables
- generated new tables based on latest Unicode data
- added `UnicodeChar` and `UnicodeStrSlice` traits to prelude
- libunicode is now the collection point for the `std::char` module, combining the libunicode functionality with the `Char` functionality from libcore
  - thus, moved doc comment for `char` from `core::char` to `unicode::char`
- libcollections remains the collection point for `std::str`

The Unicode-aware functions that previously lived in the `Char` and `StrSlice` traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and `use` the `UnicodeChar` and/or `UnicodeStrSlice` traits:

    extern crate unicode;
    use unicode::UnicodeChar;
    use unicode::UnicodeStrSlice;
    use unicode::Words; // if you want to use the words() method

NOTE: this does *not* impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude.

closes #15224
[breaking-change]
@bors bors closed this Jul 9, 2014
@bors bors merged commit 85e2bee into rust-lang:master Jul 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adding wcwidth for char in libcore
10 participants