Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hardcoded ICU transliterators #3910

Open
skius opened this issue Aug 22, 2023 · 7 comments
Open

Implement hardcoded ICU transliterators #3910

skius opened this issue Aug 22, 2023 · 7 comments
Labels
C-transliterator Component: transliterator

Comments

@skius
Copy link
Member

skius commented Aug 22, 2023

For feature parity with ICU we need some transliterators that ICU defines not using rule sources but in code. A good (maybe even complete) starting point is this directory: https://github.com/unicode-org/icu/tree/main/icu4j/main/classes/translit/src/com/ibm/icu/text

For example, EscapeTransliterator.java is responsible for the many Any-Hex variants that exist.

Some transliterators also have related components in ICU4X, like Any-NFC, so those should be implemented by reusing the ICU4X components and data.

Users can create these transliterators using BCP-47 IDs that are defined in #3909.

@skius skius added the C-unicode Component: Props, sets, tries label Aug 22, 2023
@skius skius self-assigned this Aug 22, 2023
@robertbastian
Copy link
Member

Does Any-Hex exist as a rule file as well? I.e. is implementing it in code merely a performance optimisation?

@skius
Copy link
Member Author

skius commented Aug 24, 2023

Does Any-Hex exist as a rule file as well?

Not in the usual place, so if it did, I wouldn't know where.

I.e. is implementing it in code merely a performance optimisation?

All code based transliterators are merely for performance reasons + saved human implementation time, as transform rules can implement arbitrary transforms.

@skius
Copy link
Member Author

skius commented Aug 24, 2023

(In the specific case of Any-Hex, it should even be fairly simple to generate rule files for them. I'm not sure if this also applies to NFC, etc)

@skius
Copy link
Member Author

skius commented Aug 30, 2023

There are open PRs (#3946, #3965) that add support for many such transliterators:

  • Any-Hex/{many variants} - custom code-based implementations
  • Any-{NFC, NFD, NFKC, NFKD} - existing ICU4X component-based implementations (based on icu_normalizer)
  • Any-Remove/Any-Null - trivial implementations

These make most of CLDR data usable, and can serve as examples for implementing the remainder. Notably still missing for full CLDR support:

  • Any-{Upper, Lower, Title} - can probably use icu_casemap
  • Any-BreakInternal - some legacy thing, likely a mix of code based and component based

ICU supports more than those. See the ICU4J directory for a full list.

@skius
Copy link
Member Author

skius commented Aug 31, 2023

There are a few rule-defined Upper/Lower/Title transliterators for language-specific casemapping (e.g., Turkish). Our components support these in code, so we don't have to use the rule definitions and can instead use hardcoded transliterators.

@skius skius removed their assignment Sep 1, 2023
@sffc sffc added the C-transliterator Component: transliterator label Oct 5, 2023
@sffc sffc added this to the Backlog ⟨P4⟩ milestone Oct 5, 2023
@sffc sffc removed the C-unicode Component: Props, sets, tries label Oct 5, 2023
@sffc
Copy link
Member

sffc commented Aug 30, 2024

Is it correct that Lower was not yet implemented?

@skius
Copy link
Member Author

skius commented Sep 1, 2024

Is it correct that Lower was not yet implemented?

Correct! IIRC there are no dangling implementations, everything should be linked in load_special

fn load_special<P>(
special: &str,
normalizer_provider: &P,
) -> Result<InternalTransliterator, DataError>
where
P: DataProvider<CanonicalDecompositionDataV1Marker>
+ DataProvider<CompatibilityDecompositionSupplementV1Marker>
+ DataProvider<CanonicalDecompositionTablesV1Marker>
+ DataProvider<CompatibilityDecompositionTablesV1Marker>
+ DataProvider<CanonicalCompositionsV1Marker>
+ ?Sized,
{
// TODO(#3909, #3910): add more
match special {
"any-nfc" => Ok(InternalTransliterator::Composing(
ComposingTransliterator::try_nfc(normalizer_provider)?,
)),
"any-nfkc" => Ok(InternalTransliterator::Composing(
ComposingTransliterator::try_nfkc(normalizer_provider)?,
)),
"any-nfd" => Ok(InternalTransliterator::Decomposing(
DecomposingTransliterator::try_nfd(normalizer_provider)?,
)),
"any-nfkd" => Ok(InternalTransliterator::Decomposing(
DecomposingTransliterator::try_nfkd(normalizer_provider)?,
)),
"any-null" => Ok(InternalTransliterator::Null),
"any-remove" => Ok(InternalTransliterator::Remove),
"any-hex/unicode" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("U+", "", 4, Case::Upper),
)),
"any-hex/rust" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("\\u{", "}", 2, Case::Lower),
)),
"any-hex/xml" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("&#x", ";", 1, Case::Upper),
)),
"any-hex/perl" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("\\x{", "}", 1, Case::Upper),
)),
"any-hex/plain" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("", "", 4, Case::Upper),
)),
s => Err(DataError::custom("unavailable transliterator").with_debug_context(s)),
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-transliterator Component: transliterator
Projects
None yet
Development

No branches or pull requests

3 participants