Skip to content

Add complex case mapping and title case mapping. #26039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 9, 2015

Conversation

SimonSapin
Copy link
Contributor

  • Add “complex” mappings to char::to_lowercase and char::to_uppercase, making them yield sometimes more than on char: Implement full to_{upper,lower}case algorithms #25800. str::to_lowercase and str::to_uppercase are affected as well.
  • Add char::to_titlecase, since it’s the same algorithm (just different data). However this does not add str::to_titlecase, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of std: Rename or replace str::words to side-step the ambiguity of “a word”. rfcs#1054 I made char::to_titlecase immediately #[stable], since it’s so similar to char::to_uppercase that’s already stable. Let me know if it should be #[unstable] for a while.
  • Add a special case for upper-case Sigma in word-final position in str::to_lowercase: Word-final sigma in str::to_lowercase #26035. This is the only language-independent conditional mapping currently in SpecialCasing.txt.
  • Stabilize str::to_lowercase and str::to_uppercase. The &self -> String on str signature seems straightforward enough, and the only relevant issue I’ve found is Standardize on either 'upper_case' or 'uppercase' #24536 about naming. But char already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it.

r? @alexcrichton

@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

This adds 120 mappings:

Dž dž
Dž DŽ
Lj lj
Lj LJ
Nj nj
Nj NJ
Dz dz
Dz DZ
 Ι
ᾈ ᾀ
ᾉ ᾁ
ᾊ ᾂ
ᾋ ᾃ
ᾌ ᾄ
ᾍ ᾅ
ᾎ ᾆ
ᾏ ᾇ
ᾘ ᾐ
ᾙ ᾑ
ᾚ ᾒ
ᾛ ᾓ
ᾜ ᾔ
ᾝ ᾕ
ᾞ ᾖ
ᾟ ᾗ
ᾨ ᾠ
ᾩ ᾡ
ᾪ ᾢ
ᾫ ᾣ
ᾬ ᾤ
ᾭ ᾥ
ᾮ ᾦ
ᾯ ᾧ
ᾼ ᾳ
ῌ ῃ
ῼ ῳ
Ⅰ ⅰ
Ⅱ ⅱ
Ⅲ ⅲ
Ⅳ ⅳ
Ⅴ ⅴ
Ⅵ ⅵ
Ⅶ ⅶ
Ⅷ ⅷ
Ⅸ ⅸ
Ⅹ ⅹ
Ⅺ ⅺ
Ⅻ ⅻ
Ⅼ ⅼ
Ⅽ ⅽ
Ⅾ ⅾ
Ⅿ ⅿ
ⅰ Ⅰ
ⅱ Ⅱ
ⅲ Ⅲ
ⅳ Ⅳ
ⅴ Ⅴ
ⅵ Ⅵ
ⅶ Ⅶ
ⅷ Ⅷ
ⅸ Ⅸ
ⅹ Ⅹ
ⅺ Ⅺ
ⅻ Ⅻ
ⅼ Ⅼ
ⅽ Ⅽ
ⅾ Ⅾ
ⅿ Ⅿ
Ⓐ ⓐ
Ⓑ ⓑ
Ⓒ ⓒ
Ⓓ ⓓ
Ⓔ ⓔ
Ⓕ ⓕ
Ⓖ ⓖ
Ⓗ ⓗ
Ⓘ ⓘ
Ⓙ ⓙ
Ⓚ ⓚ
Ⓛ ⓛ
Ⓜ ⓜ
Ⓝ ⓝ
Ⓞ ⓞ
Ⓟ ⓟ
Ⓠ ⓠ
Ⓡ ⓡ
Ⓢ ⓢ
Ⓣ ⓣ
Ⓤ ⓤ
Ⓥ ⓥ
Ⓦ ⓦ
Ⓧ ⓧ
Ⓨ ⓨ
Ⓩ ⓩ
ⓐ Ⓐ
ⓑ Ⓑ
ⓒ Ⓒ
ⓓ Ⓓ
ⓔ Ⓔ
ⓕ Ⓕ
ⓖ Ⓖ
ⓗ Ⓗ
ⓘ Ⓘ
ⓙ Ⓙ
ⓚ Ⓚ
ⓛ Ⓛ
ⓜ Ⓜ
ⓝ Ⓝ
ⓞ Ⓞ
ⓟ Ⓟ
ⓠ Ⓠ
ⓡ Ⓡ
ⓢ Ⓢ
ⓣ Ⓣ
ⓤ Ⓤ
ⓥ Ⓥ
ⓦ Ⓦ
ⓧ Ⓧ
ⓨ Ⓨ
ⓩ Ⓩ
…5800

As a result, the iterator returned by `char::to_uppercase` sometimes
yields two or three `char`s instead of just one.
But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation
which we decided not to include in of `std`:
rust-lang/rfcs#1054
@SimonSapin
Copy link
Contributor Author

Regarding the table sizes: there used to be two &'static [(char, char)] tables of 1100 and 1092 items, for uppercase and lowercase mapping respectively, for a total of 17 536 bytes. These are now &'static [(char, [char; 3])] with 1175 and 1092 entries, so 36 272 bytes. Adding another 1152 entries table for titlecase brings the total to 54 704 bytes.

It’s probably possible to trade off some CPU time in mapping execution to make the table smaller: most of the [char; 3] arrays are [_, 0, 0] (indicating a mapping to a single char), and the titlecase table has a large common subset with the uppercase one.

s.extend(self[..].chars().flat_map(|c| c.to_lowercase()));
for (i, c) in self[..].char_indices() {
if c == 'Σ' {
map_uppercase_sigma(self, i, &mut s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment for why this is being specially cased here?

@alexcrichton
Copy link
Member

I'm ok adding the extra table sizes for now, with --gc-sections it should get rid of unused ones anyway and we can always explore a more optimized representation later.

@alexcrichton
Copy link
Member

cc @rust-lang/libs

@@ -37,6 +37,7 @@ extern crate rustc_unicode;
mod binary_heap;
mod bit;
mod btree;
mod char; // char isn't really a collection, but didn't find a better place for this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mumble mumble collection of code points mumble mumble

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A char is one code point.

@SimonSapin
Copy link
Contributor Author

Updated.

@alexcrichton
Copy link
Member

@bors: r+ c57a412

Thanks @SimonSapin!

@bors
Copy link
Collaborator

bors commented Jun 9, 2015

⌛ Testing commit c57a412 with merge 1b36871...

@bors
Copy link
Collaborator

bors commented Jun 9, 2015

💔 Test failed - auto-win-gnu-64-nopt-t

@SimonSapin
Copy link
Contributor Author

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

That’s not because of my patch, is it?

@alexcrichton
Copy link
Member

@bors: retry

On Mon, Jun 8, 2015 at 10:57 PM, Simon Sapin notifications@github.com
wrote:

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

That’s not because of my patch, is it?


Reply to this email directly or view it on GitHub
#26039 (comment).

@bors
Copy link
Collaborator

bors commented Jun 9, 2015

⌛ Testing commit c57a412 with merge 30f6796...

@bors
Copy link
Collaborator

bors commented Jun 9, 2015

💔 Test failed - auto-mac-64-opt

@SimonSapin
Copy link
Contributor Author

Turns out char::to_{upp,low}ercase did have tests already. I hadn’t found them as I didn’t suspect libcoretest would depend on librustc_unicode.

This should fix the failure.

@SimonSapin
Copy link
Contributor Author

I’m getting an LLVM assertion error in make check locally :(

---- [run-pass] run-pass/sepcomp-lib-lto.rs stdout ----

error: compilation failed!
status: signal: 6
command: x86_64-unknown-linux-gnu/stage2/bin/rustc /home/simon/projects/rust/src/test/run-pass/sepcomp-lib-lto.rs -L x86_64-unknown-linux-gnu/test/run-pass/ --target=x86_64-unknown-linux-gnu -L x86_64-unknown-linux-gnu/test/run-pass/sepcomp-lib-lto.stage2-x86_64-unknown-linux-gnu.run-pass.libaux -o x86_64-unknown-linux-gnu/test/run-pass/sepcomp-lib-lto.stage2-x86_64-unknown-linux-gnu --cfg rtopt -O -L x86_64-unknown-linux-gnu/rt -C lto
stdout:
------------------------------------------

------------------------------------------
stderr:
------------------------------------------
rustc: ../../../../../../projects/rust/src/llvm/lib/CodeGen/LexicalScopes.cpp:179: llvm::LexicalScope* llvm::LexicalScopes::getOrCreateRegularScope(llvm::MDNode*): Assertion `DISubprogram(Scope).describes(MF->getFunction())' failed.

------------------------------------------

@alexcrichton
Copy link
Member

@bors: r+ 6369dcb

@bors
Copy link
Collaborator

bors commented Jun 9, 2015

⌛ Testing commit 6369dcb with merge 1fb2b19...

@bors
Copy link
Collaborator

bors commented Jun 9, 2015

💔 Test failed - auto-mac-64-nopt-t

@SimonSapin
Copy link
Contributor Author

command timed out: 1200 seconds without output, attempting to kill

@alexcrichton
Copy link
Member

@bors: retry

On Tue, Jun 9, 2015 at 11:26 AM, bors notifications@github.com wrote:

[image: 💔] Test failed - auto-mac-64-nopt-t
http://buildbot.rust-lang.org/builders/auto-mac-64-nopt-t/builds/5305


Reply to this email directly or view it on GitHub
#26039 (comment).

bors added a commit that referenced this pull request Jun 9, 2015
* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well.
* Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: rust-lang/rfcs#1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while.
* Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`.
* Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it.

r? @alexcrichton
@bors
Copy link
Collaborator

bors commented Jun 9, 2015

⌛ Testing commit 6369dcb with merge f06e026...

@bors bors merged commit 6369dcb into rust-lang:master Jun 9, 2015
@SimonSapin SimonSapin deleted the case-mapping branch June 10, 2015 06:21
@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Jun 19, 2015
@brson
Copy link
Contributor

brson commented Jun 19, 2015

This will need to be mentioned in the 1.2 release notes and it's a breaking change.

@SimonSapin
Copy link
Contributor Author

Is it? I don’t really know how to interpret https://github.com/rust-lang/rfcs/blob/master/text/1105-api-evolution.md#behavioral-changes here, but the new behaviour in this PR is the one that has been documented as intended since before 1.0.

@brson
Copy link
Contributor

brson commented Jun 19, 2015

@SimonSapin It's allowed, but it changes behavior and so can break users' code (regardless of what the policy may consider 'breaking').

@SimonSapin
Copy link
Contributor Author

Ok. Agreed on mentioning in the release notes.

@brson brson added the regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. label Jun 19, 2015
@brson
Copy link
Contributor

brson commented Jun 19, 2015

I tagged this with stable-regression since it's a change in behavior and that's currently the best tag to track them, even though there is definitely a difference between intentional and unintentional regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants