Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate Transliterator VarTables across directions #3646

Open
skius opened this issue Jul 7, 2023 · 3 comments
Open

Deduplicate Transliterator VarTables across directions #3646

skius opened this issue Jul 7, 2023 · 3 comments
Labels
A-performance Area: Performance (CPU, Memory) C-transliterator Component: transliterator C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-enhancement Type: Nice-to-have but not required

Comments

@skius
Copy link
Member

skius commented Jul 7, 2023

Existing data struct draft: #3627

Transform rule source files can specify a direction attribute, and in the case of direction: "both", such source files A-B.xml define both forward (A-B) and backward (B-A) transliterators. Bidirectional rule files can have rules that affect only A-B, only B-A, or both A-B and B-A.

How should we create data for bidirectional sources?

  1. Compile A-B into a data struct and B-A into a data struct (duplicating the rules that affect both directions)
  2. Create a bidirectional data struct that tries to avoid duplicate data

Discussed with @skius, @sffc, @Manishearth, @eggrobin, @younies:

  • @robertbastian - (offline concern) I do not want to duplicate too much data, the VarTable in particular
  • @skius - Storing bidirectional rules in a single struct affects locality at runtime, as a lot of rules (the ones from B-A) will just be skipped over when transliterating from A-B.
  • @eggrobin - There are also validation issues when storing both directions in a single struct, as constructs on the source side might not be legal on the target side and vice versa.
  • @sffc - A future enhancement concerning data duplication is to split out the VarTables as a separate data key which the transliterator data structs (otherwise as in suggestion 1.) will refer to. This would deduplicate the VarTables if they turn out to be a big issue size-wise.
  • @younies - I am concerned about duplication in the individual rules, e.g., storing both a > b and a < b instead of a <> b.
  • @sffc - Using option 1. actually helps against duplication, as users who only want one direction will not have the rules only affecting the other direction.
  • @skius - Switches discussion to the data loading/API side. ICU supports specifying a direction when loading transliterators, i.e., load("A-B", Backwards) loads B-A. If we want to support this, should we just require the user to specify the necessary transliterators at datagen time? (eg, the user must specify the B-A transliterator)
  • @sffc - I don't think our API has to support backwards loading. If users want A-B backwards, they should just load B-A. At datagen time, users should specify explicitly the ID of the transliterators they want, e.g., B-A.
  • @Manishearth - (Reversing an ID might not be trivial)

Proposal: Data structs only store the forward direction, so bidirectional sources get compiled into two data structs (thus any shared data is duplicated). Datagen requires the user to specify the explicit transliterator including the direction (through ID syntax, not through “forward”/”backward” notation) they want, anything else (except transitive dependencies) does not get included. Future enhancement to avoid VarTable duplication is separating out the VarTable into a separate data key and referring to that from the transliterator data structs. Runtime also only accepts the direction implicitly through the bcp47 ID.

LGTM: @Manishearth @younies @eggrobin @sffc @skius

@robertbastian thoughts?

@skius skius self-assigned this Jul 7, 2023
@skius skius added the C-unicode Component: Props, sets, tries label Jul 7, 2023
@robertbastian
Copy link
Member

LGTM

@sffc sffc added this to the 1.3 Blocking ⟨P1⟩ milestone Jul 27, 2023
@skius skius mentioned this issue Aug 3, 2023
41 tasks
@sffc
Copy link
Member

sffc commented Aug 24, 2023

The low hanging optimization is done. We can still follow up on this part:

Future enhancement to avoid VarTable duplication is separating out the VarTable into a separate data key and referring to that from the transliterator data structs.

@sffc sffc added S-medium Size: Less than a week (larger bug fix or enhancement) T-enhancement Type: Nice-to-have but not required A-performance Area: Performance (CPU, Memory) labels Aug 24, 2023
@skius skius removed their assignment Aug 29, 2023
@skius skius changed the title Transliterator data structs for bidirectional source files Deduplicate Transliterator VarTables across directions Aug 29, 2023
@skius
Copy link
Member Author

skius commented Aug 29, 2023

Thought: Would be cool if we could use a VarTable that's specialized for a single direction if the transliterator for the other direction is not part of the user requested data.

@Manishearth Manishearth added the C-transliterator Component: transliterator label Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-transliterator Component: transliterator C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

No branches or pull requests

4 participants