Skip to content

MF2.0 compromise syntax #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions spec/compromise-syntax.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# MF2.0 compromise syntax

# Intro

This syntax builds on the one from https://github.com/unicode-org/message-format-wg/pull/230
but modified to address
[@markusicu’s comments there](https://github.com/unicode-org/message-format-wg/pull/230#issuecomment-1116903103).

# Basic syntax

Messages need to delineate between literal text, placeholders, and other “code”.
We should start in “code mode” and always enclose “patterns” (text+placeholders) in curly braces.
```
{Hello world!}
{Hello {$name}!}
```

This is unusual for formatting syntaxes, but useful.
We anyway need to support selecting from among multiple patterns,
and delimiting the patterns makes it unambiguous
what white space is part of the pattern vs. serves as delimiters of “code” tokens.
For consistency, we should always enclose a pattern,
even if the message consists only of that pattern.
That also helps with embedding messages in various resource file formats,
because they can freely trim surrounding white space without
requiring escapes when a message pattern wants to start or end with spaces.

By contrast, consider the experience with the existing ICU MessageFormat syntax
which does start in “text mode”.
ICU MessageFormat has pioneered the selection among multiple patterns based on run-time arguments.
It represents selection using complex placeholders,
which has the side effect of allowing literal text and other placeholders
before and after the top-level selection placeholder.
However, for reliable translations,
there should be no translatable contents before or after the selection placeholder;
instead, each selectable pattern should form one complete “translation unit”.
Because the existing ICU MessageFormat starts in “text mode”,
even though it looks like there is no extraneous text,
spurious white space creeps in from developers’ line breaking of long message strings.
The remedy is to always use syntax to indicate the start of translatable contents.

We use curly braces to delimit patterns because
`{}` are the paired ASCII punctuation characters least commonly used in normal text.
For the same reason, we also use them for embedding placeholders in patterns.

Literal text can use any characters except for curly braces,
and except for the backslash, which we use as usual for escaping.
That is, the only special characters inside a pattern are `{}\`.
The only allowed escape sequences are `\{`, `\}`, and `\\`.
It is an error if `\` is followed by any other character.

The message syntax does not use `'` or `"`,
so that it is easy to hard-code message strings in programming language source code.

# Placeholders

Formatting a message replaces placeholders with values based on run-time arguments or special functions.
We also allow for value literals specified inside the placeholder,
instead of using an argument name;
and we also allow for invoking functions without using argument names or value literals.
```
{$name}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm discovering a little reluctance around using $ as the variable identifier, mainly based on "I have tons of strings with place holders like {someVar} that need to be {$someVar}. It also means that I can't just take my arg map--I need to decorate the variable names with a $ before I can use it. Since function and format names are decorated with a : and literals are delimited with <>, do we need the $?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compromise syntax builds on what the committee has done. AFAICT the dollar prefix seemed part of what consensus was able to form. I am personally not particularly wedded to it.

For a parser, it would be slightly easier to look for one of very few special characters. If argument names didn't have a prefix, then a parser would have to look for any identifier-start character. Given that it has to anyway do so immediately after a prefix character, it would not really add significant complication. It just comes down to what we think developers reading and writing message strings will find helpful or confusing.

{$count :number}
{$fraction :number style=percent minFractions=2}
{(25) :number}
{:specialFunction optionKey=optionValue key2=<value with spaces>}
```

An argument name is a `$` immediately followed by an identifier.
A message formatting function will typically accept a Map of argument keys to values
where the keys match argument name identifiers in the patterns of the message.

An argument name identifier may contain one or more dot (`.`) characters.
The meaning of dotted names is implementation-defined.
For example, some implementations may support some kind of multi-segment lookup
in structured value objects.

TODO: For the definition of identifiers we should consult with the Unicode Source Code Working Group.

If the placeholder specifies only an argument name,
then the formatting function is inferred from the run-time type of the argument value.
For example, a string value would simply be inserted,
and a numeric type could be formatted using some kind of default number formatter.
- TODO: In the registry, specify the default formatters for a small set of value types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is upside down. The registry should specify a set of formatters (to which an implementation can add) and these can "register" what types they service (and in what priority order). At Amazon our message formatter has a currency formatter function (PriceFormat) that handles Price objects--the Price object extends Number, but PriceFormat takes priority for that type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With “the registry” I mean the future CLDR file that defines functions with their names, options, and semantics. That should include what formatter to use for a numeric argument when no function is explicitly specified in the message. This registry could specify a different formatter for a subtype.

It sounds like what you are referring to would be some runtime object that can dynamically handle types and formatters. I think that's out of scope for this document.


The function is specified via a `:` immediately followed by an identifier.
If an argument name or a value literal is given,
then the function is usually a formatter for its expected input types.
- TODO: There still seems to be discussion about the function prefix character.
It could be some other ASCII punctuation, for example `@`.
- TODO: Functions must be listed in a registry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or installed by the implementation

Probably specify that unrecognized formats are an error or run toString equivalent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, “private use” functions need not be in the CLDR registry. I suspect that each organization would have its own registry of some kind, but mostly what this means is that there is documentation for the name, options, and semantics of each function. I don't expect this sort of registry to be parsed by implementations to actually implement formatters -- only to do validation and linting. So the formatter implementations are of course implementation-defined.

I think that a message formatting library should by default fail with an error when it does not recognize a function name. That includes functions that are registered, but not supported by a particular implementation.

- TODO: Functions that accept value literals must specify their syntax.
- TODO: Reserve a naming convention for private use functions (not in the standard registry). Examples:
- Probably best: Contains interior dots – e.g., com.google.fancyNumber –
with reverse-domain-name namespaces like Java packages.
- Starts with `_`
- Starts with `x`

When a function is specified, it can be optionally followed by options which are key-value pairs,
with `=` (and no white space) between the key identifier and the value.
The option value can contain any character other than curly braces and white space,
unless delimited like literal values.
- TODO: Each registered function must define the available options and their value syntax.
- TODO: If we allow white space in option values, then we need optional delimiters for such values. Probably the same delimiters as for literal values.

The option value can be a `$` immediately followed by an identifier.
- TODO: Define what this can mean. There is at least a use case for allowing an argument name,
to be looked up like placeholder arguments,
in the same Map given to the message formatting function.

Options are not allowed when no function is specified.

Value literals are important for developers to control the output.
For example, certain strings may need to be inlined as literals so that
they are not changed during translation.
Numeric constants need to be formatted differently depending on the target language
(e.g., which digits and separators, and the grouping style).
Date constants need to be formatted according to the target language’s calendar system.

If only a value literal is given, without specifying a function,
then its string value is used verbatim and it is read-only for translators.
- TODO: Value literals need to be delimited (they may contain spaces),
and the starting delimiter needs to be distinct from the prefixes for
argument names and functions.
Reasonable choices include `<>`, `()`, `[]`, `||`, or a pair of `` characters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to use Markdown to show a pair of grave accents in code style...

By using a longer string of enclosing backticks (cf. CommonMark Code spans).

Suggested change
Reasonable choices include `<>`, `()`, `[]`, `||`, or a pair of `` characters.
Reasonable choices include `<>`, `()`, `[]`, `||`, or ``` `` ```.

We could actually allow *both* `''` and `""` so that
a programmer who puts a message string into a string literal using one of these delimiters
could escape a value literal using the opposite delimiter.
Consider that the same delimiters should also be usable (not visually confusing)
when used in a list of selection values (see below); that probably excludes `||` and `[]`.
For a list of space-separated literals,
it would be best to use a pair of delimiters that visually indicate and distinguish
the start and end of each literal. That suggests using `()`.
For example: `[(ab c) (d ef) (g h)]`
- TODO: Define escaping inside constant values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary given the markers available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect so. I was just too lazy to spell it out while there is still bike-shedding on which delimiters to use :-)

Probably the pattern escapes plus escapes for the constant delimiters.

A placeholder must not be an empty pair of `{}` braces.

Any character that does not fit defined syntax is an error.
This leaves room for future extensions.
For example, a placeholder must start with `{` immediately followed by
the prefix character for an argument name, literal value, or function;
and after the function name there must be only white-space-separated options which
start with identifier-start characters.

# Syntactic white space

We use white space inside placeholders and in “code mode” (outside patterns) as token separators.
White space is a sequence of one or more of the characters TAB, LF, CR, SP, and maybe some more.
- TODO: For the definition of white space we should consult with the Unicode Source Code Working Group.
- TODO: Decide whether to use Unicode Pattern_White_Space or otherwise allow RLM and LRM characters.

White space can also be useful for line breaking long messages, indentation, and alignment.
However, we should not allow white space everywhere possible,
because that just leads to confusing variations in style,
and the creation of formatting tools to enforce certain styles.
For example, there is no reason to allow white space between a name or function prefix and its identifier,
around the `=` of an option, after the `{` of a placeholder, or before the `}` of a placeholder.

# Pattern selection

Messages need the ability to choose among variants of a pattern based on certain argument values.
Common examples include selecting the right plural form, and variants for different person genders.

There should be a single level of selection (not nested like in ICU MessageFormat).
It needs to support multiple selectors.

In this syntax, a list of N selectors is followed by a list of pairs where
the first element of each pair is a list of N value literals and
the second element of each pair is a pattern.
A `*` is a wildcard value that always matches.
The last variant must have a list of all wildcard values.
```
[{$count :plural offset=1 grouping=always} {$gender}]
[1 female] {{$name} added you to her circles.}
[1 male] {{$name} added you to his circles.}
[1 *] {{$name} added you to their circles.}
[* *] {{$name} added you and {#count} others to their circles.}
```

Lists are enclosed in square brackets, reminiscent of Python lists.
The opening `[` also distinguishes the selection syntax from a simple pattern.

TODO: Decide whether to enclose each value literal in
the same pair of delimiters as literals in placeholder (for consistency),
or whether to make that optional.
(The `[]` value list syntax already indicates that value literals are enclosed.)
Some literals may require it if they contain spaces.
The `*` should probably never be enclosed in literal delimiters.

Selector syntax follows placeholder syntax,
except that a function must be specified.
For the purpose of selection, there are three types of functions:
1. Select-and-format functions combine the two functionalities,
and the selection is informed by the formatting.
For example, selectors for plural variants
(different selectors for cardinal-number vs. ordinal-number variants)
Comment on lines +197 to +198
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not formatting: that's the selector type. If I say [{$count :plural type=ordinal}] I expect to get keywords out like one, few, etc. or access the numeric value of $count for selectors such as =2---just like plural rules work today.

I agree that options are needed for the selector (as shown), but tend to expect that I can still format the value with a placeholder later. In fact, I might format differently several times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not formatting: that's the selector type. If I say [{$count :plural type=ordinal}] I expect to get keywords out like one, few, etc. or access the numeric value of $count for selectors such as =2---just like plural rules work today.

I assumed that there would be different function names for plural/cardinal vs. plural/ordinal, like we have in ICU MessageFormat. But yes, it could be one "plural" function with an option.

I agree that options are needed for the selector (as shown), but tend to expect that I can still format the value with a placeholder later. In fact, I might format differently several times.

That should be strongly discouraged, especially looking at plurals. Formatting differently from what the selection was based on creates a jarring mismatch. We should design this to make it easy to do the right thing. If you need different formatting in a different part of the sentence, you can pass the same value in another argument, and you can also define a named expression for it.

have to take into account how the number is formatted.
2. Format-only functions can be used as selectors via
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but only if the output doesn't contain spaces?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can decide to support spaces by allowing or requiring delimiters around the variant values.

simple string matching of their output with the variant values.
3. Select-only functions select among variant values, but they cannot be used in pattern placeholders.

There is a simple format-only function that can be used for simple string matching.
TODO: Decide on a name for this format-only function. Consider `:string`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:select recommends itself, since we already have one just like this? Or is this different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is to build on allowing format-only functions as selectors. Calling a formatter :string makes more sense than calling it :select.


Inside a selection-variant pattern,
there is a special placeholder syntax for inserting the formatting result of a select-and-format function.
This placeholder only specifies the selector’s argument name with a distinct prefix.
It must not specify a function.
In the example above, the `{#count}` value is the input $count minus the offset,
like the `#` in an ICU PluralFormat, which is the input to the plural rules evaluation.
This is not allowed for argument names used in select-only functions.
- TODO: Bike-shedding on the prefix character, shown as `#` here.

Inside selected patterns,
the selector argument variables must not be used with the normal `$` placeholder syntax –
for example, the patterns in the preceding example must not use `{$count}`.
Comment on lines +216 to +218
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can't write:

[{$count :plural}]
[=0] {You have no items in your cart}
[one] {You have {$count :number style=spellout} item in your cart}
[_] {You have {$count : number style=spellout} items in your cart}

This seems hard for users to understand. They passed the argument by name. Why can't the format it? It isn't like the value has been consumed by whatever selector ate it previously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For plurals in particular, the formatting and selection are tied at the hip. If the spelled-out version of the number does not work grammatically like the :plural select-and-format function expected, then you get unhappy users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the style. The point I'm making is that your text says I cannot use the variable $count and a different formatter after having used it with the plural selector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the stated reasons. Don't give users rope to hang themselves if we can avoid it :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with the following

[{$count :plural}]
[=0] {You have no items in your cart}
[one] {You have {$count :number style=spellout} item in your cart}
[_] {You have {$count : number style=spellout} items in your cart}

The plural categories are tied to the hip with the formatting. With the input number 1.01d, in some languages the category is 'one' if the format is an integer, but 'other' if the format has one (or more) decimals. So you can't actually correctly compute the plural category until you've formatted.

There are two ways to solve this:

  1. tie the plural category to the formatted value, by having the formatting information up front, or
  2. require the formatting information to be identical for every instance of the placeholder (eg it is an error if they are different)

It actually works pretty nicely to have the formatter return the plural category as an (optional) byproduct of formatting, because an intermediate step to producing the formatted number is typically the exact data necessary to compute the plural category. So the cleanest is to have a syntax that draws on that in some way. There are of course a few ways to do that. One is to use an assignment, and the other would be to have the formatting options in the selector, eg

[{$count :number style=spellout}]
[=0] {You have no items in your cart}
[one] {You have {$count} item in your cart}
[_] {You have {$count} items in your cart}

Allowing that would be doubly confusing:
- It would not be clear which value is inserted.
In the example, the plural offset is subtracted from the input value,
and the formatted version of that is what is used for
evaluating the plural rules and inserting into the pattern.
- It would not be clear what formatting is applied.
The formatting function and options specified in the selector must be used,
but `{$count}` would look like the default formatter might be used.
Allowing a function-and-options specification here would be even worse.
- (If a developer does need a pattern with both the selector-modified and also the original value,
then they can pass the value twice into the message formatting function,
under different argument names.)

# Named expressions

When a message contains many variants, it is tedious, verbose, and error-prone to
repeat complicated placeholders in many of those variants.
We allow the definition of named expressions before the selection.
The patterns could then use those names.
```
$relDate={$date :relativeDateTime fields=Mdjm}
[{$count :plural offset=1} {$gender}]
[1 female] {{$name} added you to her circles {$relDate}.}
[1 male] {{$name} added you to his circles {$relDate}.}
[1 *] {{$name} added you to their circles {$relDate}.}
[* *] {{$name} added you and {#count} others to their circles {$relDate}.}
```

When a named expression is used in a pattern placeholder, then no function must be specified.
The formatting is determined by the given expression.
- TODO: Decide whether to use a different prefix for
a pattern placeholder that refers to a named expression.
Using `$` looks familiar, but
a distinct prefix would signal that this is not a normal placeholder,
and it would allow for a syntax definition (in the BNF) limited to
only the named-expression insertion.

The expression name must not be the same as that for any placeholder argument.