Implement hardcoded ICU transliterators #3910

skius · 2023-08-22T20:08:04Z

For feature parity with ICU we need some transliterators that ICU defines not using rule sources but in code. A good (maybe even complete) starting point is this directory: https://github.com/unicode-org/icu/tree/main/icu4j/main/classes/translit/src/com/ibm/icu/text

For example, EscapeTransliterator.java is responsible for the many Any-Hex variants that exist.

Some transliterators also have related components in ICU4X, like Any-NFC, so those should be implemented by reusing the ICU4X components and data.

Users can create these transliterators using BCP-47 IDs that are defined in #3909.

The text was updated successfully, but these errors were encountered:

robertbastian · 2023-08-24T08:44:53Z

Does Any-Hex exist as a rule file as well? I.e. is implementing it in code merely a performance optimisation?

skius · 2023-08-24T10:51:24Z

Does Any-Hex exist as a rule file as well?

Not in the usual place, so if it did, I wouldn't know where.

I.e. is implementing it in code merely a performance optimisation?

All code based transliterators are merely for performance reasons + saved human implementation time, as transform rules can implement arbitrary transforms.

skius · 2023-08-24T10:55:05Z

(In the specific case of Any-Hex, it should even be fairly simple to generate rule files for them. I'm not sure if this also applies to NFC, etc)

skius · 2023-08-30T00:01:53Z

There are open PRs (#3946, #3965) that add support for many such transliterators:

Any-Hex/{many variants} - custom code-based implementations
Any-{NFC, NFD, NFKC, NFKD} - existing ICU4X component-based implementations (based on icu_normalizer)
Any-Remove/Any-Null - trivial implementations

These make most of CLDR data usable, and can serve as examples for implementing the remainder. Notably still missing for full CLDR support:

Any-{Upper, Lower, Title} - can probably use icu_casemap
Any-BreakInternal - some legacy thing, likely a mix of code based and component based

ICU supports more than those. See the ICU4J directory for a full list.

skius · 2023-08-31T12:40:36Z

There are a few rule-defined Upper/Lower/Title transliterators for language-specific casemapping (e.g., Turkish). Our components support these in code, so we don't have to use the rule definitions and can instead use hardcoded transliterators.

sffc · 2024-08-30T18:07:07Z

Is it correct that Lower was not yet implemented?

skius · 2024-09-01T00:06:18Z

Is it correct that Lower was not yet implemented?

Correct! IIRC there are no dangling implementations, everything should be linked in load_special

icu4x/components/experimental/src/transliterate/transliterator/mod.rs

Lines 341 to 386 in 6b5a69c

    
           fn load_special<P>( 
        
               special: &str, 
        
               normalizer_provider: &P, 
        
           ) -> Result<InternalTransliterator, DataError> 
        
           where 
        
               P: DataProvider<CanonicalDecompositionDataV1Marker> 
        
                   + DataProvider<CompatibilityDecompositionSupplementV1Marker> 
        
                   + DataProvider<CanonicalDecompositionTablesV1Marker> 
        
                   + DataProvider<CompatibilityDecompositionTablesV1Marker> 
        
                   + DataProvider<CanonicalCompositionsV1Marker> 
        
                   + ?Sized, 
        
           { 
        
               // TODO(#3909, #3910): add more 
        
               match special { 
        
                   "any-nfc" => Ok(InternalTransliterator::Composing( 
        
                       ComposingTransliterator::try_nfc(normalizer_provider)?, 
        
                   )), 
        
                   "any-nfkc" => Ok(InternalTransliterator::Composing( 
        
                       ComposingTransliterator::try_nfkc(normalizer_provider)?, 
        
                   )), 
        
                   "any-nfd" => Ok(InternalTransliterator::Decomposing( 
        
                       DecomposingTransliterator::try_nfd(normalizer_provider)?, 
        
                   )), 
        
                   "any-nfkd" => Ok(InternalTransliterator::Decomposing( 
        
                       DecomposingTransliterator::try_nfkd(normalizer_provider)?, 
        
                   )), 
        
                   "any-null" => Ok(InternalTransliterator::Null), 
        
                   "any-remove" => Ok(InternalTransliterator::Remove), 
        
                   "any-hex/unicode" => Ok(InternalTransliterator::Hex( 
        
                       hardcoded::HexTransliterator::new("U+", "", 4, Case::Upper), 
        
                   )), 
        
                   "any-hex/rust" => Ok(InternalTransliterator::Hex( 
        
                       hardcoded::HexTransliterator::new("\\u{", "}", 2, Case::Lower), 
        
                   )), 
        
                   "any-hex/xml" => Ok(InternalTransliterator::Hex( 
        
                       hardcoded::HexTransliterator::new("&#x", ";", 1, Case::Upper), 
        
                   )), 
        
                   "any-hex/perl" => Ok(InternalTransliterator::Hex( 
        
                       hardcoded::HexTransliterator::new("\\x{", "}", 1, Case::Upper), 
        
                   )), 
        
                   "any-hex/plain" => Ok(InternalTransliterator::Hex( 
        
                       hardcoded::HexTransliterator::new("", "", 4, Case::Upper), 
        
                   )), 
        
                   s => Err(DataError::custom("unavailable transliterator").with_debug_context(s)), 
        
               } 
        
           }

skius added the C-unicode Component: Props, sets, tries label Aug 22, 2023

skius self-assigned this Aug 22, 2023

This was referenced Aug 22, 2023

Invent BCP47 IDs for hardcoded transliterators #3909

Open

Checklist for Transliteration #3736

Closed

This was referenced Aug 29, 2023

Stabilize Transliterators #3961

Open

Add basic hardcoded Any-Hex transliterators #3965

Merged

skius removed their assignment Sep 1, 2023

sffc added the C-transliterator Component: transliterator label Oct 5, 2023

sffc added this to the Backlog ⟨P4⟩ milestone Oct 5, 2023

sffc removed the C-unicode Component: Props, sets, tries label Oct 5, 2023

sffc mentioned this issue Aug 30, 2024

Add test for custom transliterator using Latin-ASCII and Lower #5469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hardcoded ICU transliterators #3910

Implement hardcoded ICU transliterators #3910

skius commented Aug 22, 2023 •

edited

Loading

robertbastian commented Aug 24, 2023

skius commented Aug 24, 2023

skius commented Aug 24, 2023

skius commented Aug 30, 2023

skius commented Aug 31, 2023

sffc commented Aug 30, 2024

skius commented Sep 1, 2024

Implement hardcoded ICU transliterators #3910

Implement hardcoded ICU transliterators #3910

Comments

skius commented Aug 22, 2023 • edited Loading

robertbastian commented Aug 24, 2023

skius commented Aug 24, 2023

skius commented Aug 24, 2023

skius commented Aug 30, 2023

skius commented Aug 31, 2023

sffc commented Aug 30, 2024

skius commented Sep 1, 2024

skius commented Aug 22, 2023 •

edited

Loading