Skip to content

Conversation

@djudd
Copy link
Contributor

@djudd djudd commented Jul 21, 2022

This is a proposed feature, which seems to me to go naturally with the other case support in this crate (which I'm already using for the titlecasing - thanks!). I'm looking for feedback, with a few potential outcomes that make sense to me:

  • If you're interested in merging this ~as-is, I'll just update the documentation and anything else I've missed here.
  • If you're interested in the feature but have concerns about the increased table size, I can explore if I can find ways to compress that (e.g. one idea: titlecase is usually uppercase, folded is usually lowercase, so we could just store the exceptions).
  • If you'd rather not add the feature at all, that's fine of course and I can find another solution or just use my fork.

Copy link
Contributor

@wezm wezm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for considering the options for incorporating this. Overall I think it's a reasonable addition to the crate. Comparing the size of the generated table before and after this change I see 91.5KiB vs. 102.9KiB. A modest increase but at around 10% I think it's work exploring options to avoid it if consumers of the crate are not making use of the case-folding data.

On way to go about that is to generate a separate multi-level lookup table, similar to what is already done for the current mappings but a little simpler (like unicode-script).

Alternatively given there's only a bit under 1500 entires in the case folding table it's probably worth seeing if a simple match is compact enough too. We already support this in the bidi-mirroring-glyph sub-command. It looks like it would be quite straightforward to support that in case-folding-simple too.

src/tables.rs Outdated
];

#[allow(dead_code)]
pub const CASE_FOLDING_SIMPLE: &'static [(u32, u32)] = &[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this came from something like yeslogic-ucd-generate case-folding-simple /home/wmoore/Downloads/ucd-14.0, It would probably best if this data was in it's own file so the header about how it was generated is included.

@djudd
Copy link
Contributor Author

djudd commented Jul 21, 2022

I've tried out the match strategy: https://github.com/djudd/unicode-case-mapping/pull/new/case-folding-match

I'm not sure how you were measuring the size of the table directly, but it looks like this adds 53kb to the size of the rlib when I build in release mode locally, vs master. Maybe that's an improvement if this way we expect it to be pruned away (by the linker?) in cases where it's not used? Or should I look for other options?

@wezm
Copy link
Contributor

wezm commented Jul 22, 2022

I'm not sure how you were measuring the size of the table directly

I looked at the file generated by build.rs and multiplied the std::mem::size_of::<Row>() by the number of items in the table, before and after this change.

To test your match branch I added a main.rs and compared the size of a stripped release binary.

use std::env;

fn main() {
    for arg in env::args().skip(1) {
        for ch in arg.chars() {
            println!(
                "{}: {:?} {:?} {:?} {}",
                ch,
                unicode_case_mapping::to_lowercase(ch),
                unicode_case_mapping::to_uppercase(ch),
                unicode_case_mapping::to_titlecase(ch),
                unicode_case_mapping::case_folded(ch), // this was replaced with 0 on master/no case folding cases
            );
        }
    }
}
Branch Size No Case Folding Size Case Folding
master 418.5
case-folding 430.5 434.5
case-folding-match 418.5 474.5

This shows that on the case-folding-match branch that it behaves as expected. If you don't use case folding then you don't pay a size cost for it. It also shows that the multi-level table approach does seem to be more compact. If you can be bothered exploring that then that would be great. If not, that's understandable and I'm happy to proceed with the match version.

@djudd
Copy link
Contributor Author

djudd commented Jul 22, 2022

I don't think I understand the details of the table approach well enough to adapt it intelligently. I'd propose merging the match version and leaving a move to a multi-level table as a potential followup optimization.

(FWIW it seems like it might also be worth splitting the other cases up into their own tables by the same don't-pay-for-what-you-don't-use logic?)

Copy link
Contributor Author

@djudd djudd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to bump the version (to 0.4.0 I imagine) in this PR as well, or leave that for you to do separately?

/// Map the supplied character to its case-folded equivalent.
///
/// **Note:** A result of zero indicates the codepoint maps to itself.
pub fn case_folded(chr: char) -> u32 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about mapping to plain u32 instead of Option; it's likely less nice to use but it's consistent with other methods which seemed maybe more important?

@djudd djudd changed the title [wip] Add case folding Add case folding Jul 22, 2022
@wezm
Copy link
Contributor

wezm commented Jul 26, 2022

Thanks for the updates. No need to bump the version. I tweaked a couple of things and merged as edc8612. I ended up going with returning an Option, as you mentioned, as I think it makes more sense for this case (since it's already an Option).

@wezm wezm closed this Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants