-
Notifications
You must be signed in to change notification settings - Fork 5
Add case folding #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wezm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for considering the options for incorporating this. Overall I think it's a reasonable addition to the crate. Comparing the size of the generated table before and after this change I see 91.5KiB vs. 102.9KiB. A modest increase but at around 10% I think it's work exploring options to avoid it if consumers of the crate are not making use of the case-folding data.
On way to go about that is to generate a separate multi-level lookup table, similar to what is already done for the current mappings but a little simpler (like unicode-script).
Alternatively given there's only a bit under 1500 entires in the case folding table it's probably worth seeing if a simple match is compact enough too. We already support this in the bidi-mirroring-glyph sub-command. It looks like it would be quite straightforward to support that in case-folding-simple too.
src/tables.rs
Outdated
| ]; | ||
|
|
||
| #[allow(dead_code)] | ||
| pub const CASE_FOLDING_SIMPLE: &'static [(u32, u32)] = &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this came from something like yeslogic-ucd-generate case-folding-simple /home/wmoore/Downloads/ucd-14.0, It would probably best if this data was in it's own file so the header about how it was generated is included.
|
I've tried out the match strategy: https://github.com/djudd/unicode-case-mapping/pull/new/case-folding-match I'm not sure how you were measuring the size of the table directly, but it looks like this adds 53kb to the size of the rlib when I build in release mode locally, vs master. Maybe that's an improvement if this way we expect it to be pruned away (by the linker?) in cases where it's not used? Or should I look for other options? |
I looked at the file generated by To test your use std::env;
fn main() {
for arg in env::args().skip(1) {
for ch in arg.chars() {
println!(
"{}: {:?} {:?} {:?} {}",
ch,
unicode_case_mapping::to_lowercase(ch),
unicode_case_mapping::to_uppercase(ch),
unicode_case_mapping::to_titlecase(ch),
unicode_case_mapping::case_folded(ch), // this was replaced with 0 on master/no case folding cases
);
}
}
}
This shows that on the |
|
I don't think I understand the details of the table approach well enough to adapt it intelligently. I'd propose merging the match version and leaving a move to a multi-level table as a potential followup optimization. (FWIW it seems like it might also be worth splitting the other cases up into their own tables by the same don't-pay-for-what-you-don't-use logic?) |
djudd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to bump the version (to 0.4.0 I imagine) in this PR as well, or leave that for you to do separately?
| /// Map the supplied character to its case-folded equivalent. | ||
| /// | ||
| /// **Note:** A result of zero indicates the codepoint maps to itself. | ||
| pub fn case_folded(chr: char) -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about mapping to plain u32 instead of Option; it's likely less nice to use but it's consistent with other methods which seemed maybe more important?
|
Thanks for the updates. No need to bump the version. I tweaked a couple of things and merged as edc8612. I ended up going with returning an |
This is a proposed feature, which seems to me to go naturally with the other case support in this crate (which I'm already using for the titlecasing - thanks!). I'm looking for feedback, with a few potential outcomes that make sense to me: