Localization Units Formatting

This is a complete braindump of my late night revelation that may be genius, crazy, foolish or any combination of those.

# Background

It started with realization that the irk I have with the name of our group overlaps with the irk that Mihai expressed, but for different reasons. Mihai said "I think we may come up with something very different than MF 1.0, so naming it 2.0 is misleading and may implicitly steer us toward trying to salvage MF similiarity for compatibility reasons which may be a sunk cost fallacy" (paraphrase mine).

I reacted positively to that, because I recognize that there is a natural drift to "add to MF 1.0" just like I may have a drift to "bring Fluent to MF 2.0", and I think it may be limiting us in designing the optimal solution.

But as I dug deeper I realized that the concern I have is with the word "Message". The fact that we talk about formatting messages is already misaligned with how I think modern UI localization mental models should work.

For a simple textual app, you can have something like:

```cpp
printf("You have 5 new messages.\n");
```

and MessageFormat 1.0 contains data model, syntax, logic and API to *internationalize* this line of code.

But UI paradigms are fundamentally different.

Let me give you an example:

# Example

![dialog-boxes-messagebox-default-button](https://user-images.githubusercontent.com/449986/94376678-d9128800-00d0-11eb-8858-398427744144.png)

What does it mean to localize it? What is the "message" and what do we mean by "formatting" it in such context?

There's definitely going to be some formatting going on, there are 4 strings in this widget, and an icon, but what is the "message"?

Well, you can decompose this widget into four separate widgets (title, label, button-ok, button-cancel) and try to say "each one of those has a value and that value is a message!", and I believe that's the most common model of approaching it.

But it doesn't scale in so many ways:

1) If there's a relation between the message and buttons (see Welsh where there is no generic yes/no, and a label for the button has to depend on the message it answers to question), we lost it
2) If there's any meta information about the widget, or its localization, it is now decomposed into four independent messages
3) If there is any behavior to between localization and widget, we need to perform it four times, one per message
4) If there are any arguments that are required to localize this widget, we need to send them to four messages
5) If we'd want the UI toolkit to plug "localize" step before layout/paint, we need to write some code that formats those 4 messages and applies them onto that widget
6) Is the icon a fifth message? It may flip in RTL contexts, and icons may contain text or culturally specific graphics that may have to be part of the localization of this widget.
7) What if the button-ok, button-cancel, icon, label or the whole modal window have tooltips?
8) What if they have accesskeys?
9) What happens when there's any error in applying localization onto this widget? Are we falling back onto another locale? For one of two buttons? For label but not for buttons? How do we reconcile?
10) Is localization of the button synchronous, or asynchronous? If there's fallback, which may require I/O for resources, is it synchronous or asynchronous? How does the binding function for the widget to apply those 4-5-10 messages onto it look like?
10) Can you retranslate this widget to a different locale during UI lifetime, or do you have to recreate it in a different locale, remove the old one, add new one? If so, are you losing event bindings and state?
11) Can you cache the state of this element pre-localization, post-localization, can you invaidate cache of this widget if while loading you realize that translation is obsolete?
12) If the widget text is more complicated - if it's a paragraph of text, with images, stylistic annotations, or smart sentences like "Refresh the page every *5* minutes" where `5` is actually a numerical text input, or select dropdown, or your text for this widget is a list of items where the structure and number of items should be controlled by the localizer. How do you handle that when you are merely formatting a single string and you don't have a notion that it is part of a UI that is a nested tree structure with attributes, events, text, icons and data?

# Two topics, that are intertwined but separate

I recognize that there are two topics here, my last question is from a bit different category.

1) Do we want to support localization of UI elements/widgets which are usually much more compound than a single string
2) Do we want to support localization of messages that have semantic fragments inside them

I believe that the questions are related, because they relate to breaking with the idea that a message is a string and a UI is a list of messages.
In this model, UI is a tree (not list!) of compound widgets, each having multiple strings inside it, and each string may have its own UI fragment inside it.

Both of those issues are rooted in how UI is different from plain text, but we should imho treat those two questions separately and be open to having different solutions, or even considering one in scope, and another out of scope.

I'm bringing them up here because I want to challenge us with thinking about end-to-end localization of UI, and then you need to consider both.

# How to design it?

Designing that system is actually very tricky if you stick to thinking of localization step of the UI toolkit as taking messages (strings), formatting them, and then applying in correct positions in the UI widget.
You need a lot of boilerplate code that has to either be controlled by the developer writing the code, or by the widget code, or by the toolkit and in each case is non trivial, hard to handle sync/async, limits fallbacks and, I will argue, ...

*misses the point*.

# Localization Unit

Because you cannot localize a compound nested, rich User Interface widgets by formatting "messages".
You need a concept that is broader than a single string - something I started calling in my mind "Localization Unit".
This of all the data needed to localize the above example:

```json
hello-prompt  = {
    "meta": {
      "role": "modal window",
      "description": "..."
    }
    "elements": {
         "label": ["Hello, ", Element("strong", [Argument("userName")]), "!"],
         "button-ok": {
           "label": "Ok", //  In Welsh `[Reference("self", "label"), "lorem ipsum"]`
           "accesskey": "O",
           "tooltip": "Click to accept"
         },
         "button-cancel": {
           "label": "Cancel",
           "accesskey": "C",
           "tooltip": "Click to reject"
         },
         "close-icon": {
           "tooltip": "Close the prompt"
         },
         "main-icon": {
           "url": "@icon-path",
           "aria-label": "Question mark icon"
         }
    }
}
```

And once you have it, you can do the most natural thing: you can bind such UI element to a corresponding localization unit.

```
<prompt
  l10n-id="hello-prompt"
  l10n-args="{userName: 'John'}"
>
```

or:

```js
prompt.l10n.id = "hello-prompt";
prompt.l10n.args.set("userName", "John");
```

Such binding is declarative, just like applying a CSS class onto an element is, and it allows the engine to understand that before layout and painting steps for this element some resources need to be retrieved, their Localization Units must be resolved and the combination of the element and its localization unit is what gets laid out and painted.

This model has a huge number of benefits:
* Localization Unit may be nested
* it may have multiple messages and icons and other data inside it
* its shape corresponds to the shape of the UI element it is meant to localize
* it has meta data associated with the unit
* it has a fallback that is reasonable and operates per-widget
* the toolkit can apply the localization unit, reapply it, remove it, and modify when it needs to because all the information needed to localize the element is in the annotation for that element
* tooling understands that this unit is a compound structure that doesn't pretend to be "flat" and "printf with params"
* CAT tools can reason about how the element looks like, and even pull such element and WYSIWYG apply translation onto it as the localizer is translating it.
* data and contexts are relevant to the widget, rather than pretending that a tooltip of `button-ok` is a standalone message
* developer writes code synchronously just annotating the UI which is a synchronous operation and allowing UI toolkit to react to changes in the animation frame cycle

# LocalizationUnitFormatter

In ICU we actually already have a notion of such intermediate representation of data - `FormattedX`. For example, `DateTimeFormatter` produces `FormattedDateTime` which has a lot of information allowing users to introspect, operate and maybe even manipulate formatted data. The user can also just `toString()` it to get the result.

What if we had `LocalizationFormatter` which has a `format` method that returns `FormattedLocalizationUnit` which has all the information needed for a UI toolkit to combine it with `Label`, `MenuItem` or `Button` or any other widget and produce a `LocalizedElement` or `LocalizedWidget` that will be then laid out and painted?

And for the imperative case, we could still have `toString` which would take the value of the `LocalizationUnit` if it has one, and just print it as a string for the familiar `printf` experience.

# What's in scope?

I don't know yet. It's kind of a fresh realization and I'm not sure if my recommendation for the group is to:

a) Consider `Localization Unit` in scope as a level above MessageFormatter.
b) Consider `Localization Unit` out of scope, but the right paradigm for UI localization and therefore work on having MessageFormat 2.0 be a good lower level API for it
c) Consider `Localization Unit` one of many paradigms for UI localization and not tie our work to it
d) Consider `Localization Unit` a bad paragidm and design a better one

# Why am I raising it?

The reason I think it is important is that we need to early on decide whether what our target is does:

```
printf("Hello, { $user }");
Label.textContent = format("Hello, { $user } ");
```

and we are ok thinking of the receiving end as flat textual strings, or do we want to embrace that fact that this is not how UI localization is today.
That `Label` may have multiple attributes, and icons, and other values and each one may be a nested structure of data and localization may bring its own UI fragments that need to be overlapped with source fragments.
That the function in which you call `printf` is not the right place to synchronously annotated the UI with a string, because then the toolkit doesn't know that the UI is localized, cannot retranslate, cannot cache, cannot invalidate that cache, and cannot have responsive localization.

I think that decisions around it will have deep consequences for our thinking about many items on our wishlist (#3)
* what is the data model of the single Message (#26)
* how/if we want to allow inter-message references, inter-message data, inter-message meta-data (#98)
* do we want to allow association of non-text with our Messages (or Localization Unit containing Messages and NonMessages)
* how fallback works in UI setup? Is it performed synchronously at callsite? Is every callsite asynchronous? Do we need to resolve all messages before we being applying translations? (#45)
* Can we have responsive localization (localization that reacts to locale changes, or argument changes) (#65)

I wrote a separate comment for Raph's new UI toolkit paragidm over last day of wrangling with this concept. If you're interested in more particular tangible application of how it may look like, consider reading https://github.com/raphlinus/crochet/issues/7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Localization Units Formatting #118

Background

Example

Two topics, that are intertwined but separate

How to design it?

Localization Unit

LocalizationUnitFormatter

What's in scope?

Why am I raising it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Localization Units Formatting #118

Description

Background

Example

Two topics, that are intertwined but separate

How to design it?

Localization Unit

LocalizationUnitFormatter

What's in scope?

Why am I raising it?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions