-
-
Notifications
You must be signed in to change notification settings - Fork 36
MF2.0 compromise syntax #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,256 @@ | ||||||
# MF2.0 compromise syntax | ||||||
|
||||||
# Intro | ||||||
|
||||||
This syntax builds on the one from https://github.com/unicode-org/message-format-wg/pull/230 | ||||||
but modified to address | ||||||
[@markusicu’s comments there](https://github.com/unicode-org/message-format-wg/pull/230#issuecomment-1116903103). | ||||||
|
||||||
# Basic syntax | ||||||
|
||||||
Messages need to delineate between literal text, placeholders, and other “code”. | ||||||
We should start in “code mode” and always enclose “patterns” (text+placeholders) in curly braces. | ||||||
``` | ||||||
{Hello world!} | ||||||
{Hello {$name}!} | ||||||
``` | ||||||
|
||||||
This is unusual for formatting syntaxes, but useful. | ||||||
We anyway need to support selecting from among multiple patterns, | ||||||
and delimiting the patterns makes it unambiguous | ||||||
what white space is part of the pattern vs. serves as delimiters of “code” tokens. | ||||||
For consistency, we should always enclose a pattern, | ||||||
even if the message consists only of that pattern. | ||||||
That also helps with embedding messages in various resource file formats, | ||||||
because they can freely trim surrounding white space without | ||||||
requiring escapes when a message pattern wants to start or end with spaces. | ||||||
|
||||||
By contrast, consider the experience with the existing ICU MessageFormat syntax | ||||||
which does start in “text mode”. | ||||||
ICU MessageFormat has pioneered the selection among multiple patterns based on run-time arguments. | ||||||
It represents selection using complex placeholders, | ||||||
which has the side effect of allowing literal text and other placeholders | ||||||
before and after the top-level selection placeholder. | ||||||
However, for reliable translations, | ||||||
there should be no translatable contents before or after the selection placeholder; | ||||||
instead, each selectable pattern should form one complete “translation unit”. | ||||||
Because the existing ICU MessageFormat starts in “text mode”, | ||||||
even though it looks like there is no extraneous text, | ||||||
spurious white space creeps in from developers’ line breaking of long message strings. | ||||||
The remedy is to always use syntax to indicate the start of translatable contents. | ||||||
|
||||||
We use curly braces to delimit patterns because | ||||||
`{}` are the paired ASCII punctuation characters least commonly used in normal text. | ||||||
For the same reason, we also use them for embedding placeholders in patterns. | ||||||
|
||||||
Literal text can use any characters except for curly braces, | ||||||
and except for the backslash, which we use as usual for escaping. | ||||||
That is, the only special characters inside a pattern are `{}\`. | ||||||
The only allowed escape sequences are `\{`, `\}`, and `\\`. | ||||||
It is an error if `\` is followed by any other character. | ||||||
|
||||||
The message syntax does not use `'` or `"`, | ||||||
so that it is easy to hard-code message strings in programming language source code. | ||||||
markusicu marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
# Placeholders | ||||||
|
||||||
Formatting a message replaces placeholders with values based on run-time arguments or special functions. | ||||||
We also allow for value literals specified inside the placeholder, | ||||||
instead of using an argument name; | ||||||
and we also allow for invoking functions without using argument names or value literals. | ||||||
``` | ||||||
{$name} | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I'm discovering a little reluctance around using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This compromise syntax builds on what the committee has done. AFAICT the dollar prefix seemed part of what consensus was able to form. I am personally not particularly wedded to it. For a parser, it would be slightly easier to look for one of very few special characters. If argument names didn't have a prefix, then a parser would have to look for any identifier-start character. Given that it has to anyway do so immediately after a prefix character, it would not really add significant complication. It just comes down to what we think developers reading and writing message strings will find helpful or confusing. |
||||||
{$count :number} | ||||||
{$fraction :number style=percent minFractions=2} | ||||||
{(25) :number} | ||||||
{:specialFunction optionKey=optionValue key2=<value with spaces>} | ||||||
``` | ||||||
|
||||||
An argument name is a `$` immediately followed by an identifier. | ||||||
A message formatting function will typically accept a Map of argument keys to values | ||||||
where the keys match argument name identifiers in the patterns of the message. | ||||||
|
||||||
An argument name identifier may contain one or more dot (`.`) characters. | ||||||
The meaning of dotted names is implementation-defined. | ||||||
For example, some implementations may support some kind of multi-segment lookup | ||||||
in structured value objects. | ||||||
|
||||||
TODO: For the definition of identifiers we should consult with the Unicode Source Code Working Group. | ||||||
|
||||||
If the placeholder specifies only an argument name, | ||||||
then the formatting function is inferred from the run-time type of the argument value. | ||||||
For example, a string value would simply be inserted, | ||||||
and a numeric type could be formatted using some kind of default number formatter. | ||||||
- TODO: In the registry, specify the default formatters for a small set of value types. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps this is upside down. The registry should specify a set of formatters (to which an implementation can add) and these can "register" what types they service (and in what priority order). At Amazon our message formatter has a currency formatter function ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With “the registry” I mean the future CLDR file that defines functions with their names, options, and semantics. That should include what formatter to use for a numeric argument when no function is explicitly specified in the message. This registry could specify a different formatter for a subtype. It sounds like what you are referring to would be some runtime object that can dynamically handle types and formatters. I think that's out of scope for this document. |
||||||
|
||||||
The function is specified via a `:` immediately followed by an identifier. | ||||||
If an argument name or a value literal is given, | ||||||
then the function is usually a formatter for its expected input types. | ||||||
- TODO: There still seems to be discussion about the function prefix character. | ||||||
It could be some other ASCII punctuation, for example `@`. | ||||||
- TODO: Functions must be listed in a registry. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... or installed by the implementation Probably specify that unrecognized formats are an error or run toString equivalent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, “private use” functions need not be in the CLDR registry. I suspect that each organization would have its own registry of some kind, but mostly what this means is that there is documentation for the name, options, and semantics of each function. I don't expect this sort of registry to be parsed by implementations to actually implement formatters -- only to do validation and linting. So the formatter implementations are of course implementation-defined. I think that a message formatting library should by default fail with an error when it does not recognize a function name. That includes functions that are registered, but not supported by a particular implementation. |
||||||
- TODO: Functions that accept value literals must specify their syntax. | ||||||
- TODO: Reserve a naming convention for private use functions (not in the standard registry). Examples: | ||||||
- Probably best: Contains interior dots – e.g., com.google.fancyNumber – | ||||||
with reverse-domain-name namespaces like Java packages. | ||||||
- Starts with `_` | ||||||
- Starts with `x` | ||||||
|
||||||
When a function is specified, it can be optionally followed by options which are key-value pairs, | ||||||
with `=` (and no white space) between the key identifier and the value. | ||||||
The option value can contain any character other than curly braces and white space, | ||||||
unless delimited like literal values. | ||||||
- TODO: Each registered function must define the available options and their value syntax. | ||||||
- TODO: If we allow white space in option values, then we need optional delimiters for such values. Probably the same delimiters as for literal values. | ||||||
|
||||||
The option value can be a `$` immediately followed by an identifier. | ||||||
- TODO: Define what this can mean. There is at least a use case for allowing an argument name, | ||||||
to be looked up like placeholder arguments, | ||||||
in the same Map given to the message formatting function. | ||||||
|
||||||
Options are not allowed when no function is specified. | ||||||
|
||||||
Value literals are important for developers to control the output. | ||||||
For example, certain strings may need to be inlined as literals so that | ||||||
they are not changed during translation. | ||||||
Numeric constants need to be formatted differently depending on the target language | ||||||
(e.g., which digits and separators, and the grouping style). | ||||||
Date constants need to be formatted according to the target language’s calendar system. | ||||||
|
||||||
If only a value literal is given, without specifying a function, | ||||||
then its string value is used verbatim and it is read-only for translators. | ||||||
- TODO: Value literals need to be delimited (they may contain spaces), | ||||||
and the starting delimiter needs to be distinct from the prefixes for | ||||||
argument names and functions. | ||||||
Reasonable choices include `<>`, `()`, `[]`, `||`, or a pair of `` characters. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
By using a longer string of enclosing backticks (cf. CommonMark Code spans).
Suggested change
|
||||||
We could actually allow *both* `''` and `""` so that | ||||||
a programmer who puts a message string into a string literal using one of these delimiters | ||||||
could escape a value literal using the opposite delimiter. | ||||||
Consider that the same delimiters should also be usable (not visually confusing) | ||||||
when used in a list of selection values (see below); that probably excludes `||` and `[]`. | ||||||
For a list of space-separated literals, | ||||||
it would be best to use a pair of delimiters that visually indicate and distinguish | ||||||
the start and end of each literal. That suggests using `()`. | ||||||
For example: `[(ab c) (d ef) (g h)]` | ||||||
- TODO: Define escaping inside constant values. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is necessary given the markers available. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect so. I was just too lazy to spell it out while there is still bike-shedding on which delimiters to use :-) |
||||||
Probably the pattern escapes plus escapes for the constant delimiters. | ||||||
|
||||||
A placeholder must not be an empty pair of `{}` braces. | ||||||
|
||||||
Any character that does not fit defined syntax is an error. | ||||||
This leaves room for future extensions. | ||||||
For example, a placeholder must start with `{` immediately followed by | ||||||
the prefix character for an argument name, literal value, or function; | ||||||
and after the function name there must be only white-space-separated options which | ||||||
start with identifier-start characters. | ||||||
|
||||||
# Syntactic white space | ||||||
|
||||||
We use white space inside placeholders and in “code mode” (outside patterns) as token separators. | ||||||
White space is a sequence of one or more of the characters TAB, LF, CR, SP, and maybe some more. | ||||||
- TODO: For the definition of white space we should consult with the Unicode Source Code Working Group. | ||||||
- TODO: Decide whether to use Unicode Pattern_White_Space or otherwise allow RLM and LRM characters. | ||||||
|
||||||
White space can also be useful for line breaking long messages, indentation, and alignment. | ||||||
However, we should not allow white space everywhere possible, | ||||||
because that just leads to confusing variations in style, | ||||||
and the creation of formatting tools to enforce certain styles. | ||||||
For example, there is no reason to allow white space between a name or function prefix and its identifier, | ||||||
around the `=` of an option, after the `{` of a placeholder, or before the `}` of a placeholder. | ||||||
|
||||||
# Pattern selection | ||||||
|
||||||
Messages need the ability to choose among variants of a pattern based on certain argument values. | ||||||
Common examples include selecting the right plural form, and variants for different person genders. | ||||||
|
||||||
There should be a single level of selection (not nested like in ICU MessageFormat). | ||||||
It needs to support multiple selectors. | ||||||
|
||||||
In this syntax, a list of N selectors is followed by a list of pairs where | ||||||
the first element of each pair is a list of N value literals and | ||||||
the second element of each pair is a pattern. | ||||||
A `*` is a wildcard value that always matches. | ||||||
The last variant must have a list of all wildcard values. | ||||||
``` | ||||||
[{$count :plural offset=1 grouping=always} {$gender}] | ||||||
[1 female] {{$name} added you to her circles.} | ||||||
[1 male] {{$name} added you to his circles.} | ||||||
[1 *] {{$name} added you to their circles.} | ||||||
[* *] {{$name} added you and {#count} others to their circles.} | ||||||
``` | ||||||
|
||||||
Lists are enclosed in square brackets, reminiscent of Python lists. | ||||||
The opening `[` also distinguishes the selection syntax from a simple pattern. | ||||||
|
||||||
TODO: Decide whether to enclose each value literal in | ||||||
the same pair of delimiters as literals in placeholder (for consistency), | ||||||
or whether to make that optional. | ||||||
(The `[]` value list syntax already indicates that value literals are enclosed.) | ||||||
Some literals may require it if they contain spaces. | ||||||
The `*` should probably never be enclosed in literal delimiters. | ||||||
|
||||||
Selector syntax follows placeholder syntax, | ||||||
except that a function must be specified. | ||||||
For the purpose of selection, there are three types of functions: | ||||||
1. Select-and-format functions combine the two functionalities, | ||||||
and the selection is informed by the formatting. | ||||||
For example, selectors for plural variants | ||||||
(different selectors for cardinal-number vs. ordinal-number variants) | ||||||
Comment on lines
+197
to
+198
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's not formatting: that's the selector type. If I say I agree that options are needed for the selector (as shown), but tend to expect that I can still format the value with a placeholder later. In fact, I might format differently several times. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I assumed that there would be different function names for plural/cardinal vs. plural/ordinal, like we have in ICU MessageFormat. But yes, it could be one "plural" function with an option.
That should be strongly discouraged, especially looking at plurals. Formatting differently from what the selection was based on creates a jarring mismatch. We should design this to make it easy to do the right thing. If you need different formatting in a different part of the sentence, you can pass the same value in another argument, and you can also define a named expression for it. |
||||||
have to take into account how the number is formatted. | ||||||
2. Format-only functions can be used as selectors via | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but only if the output doesn't contain spaces? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can decide to support spaces by allowing or requiring delimiters around the variant values. |
||||||
simple string matching of their output with the variant values. | ||||||
3. Select-only functions select among variant values, but they cannot be used in pattern placeholders. | ||||||
|
||||||
There is a simple format-only function that can be used for simple string matching. | ||||||
TODO: Decide on a name for this format-only function. Consider `:string`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The suggestion is to build on allowing format-only functions as selectors. Calling a formatter |
||||||
|
||||||
Inside a selection-variant pattern, | ||||||
there is a special placeholder syntax for inserting the formatting result of a select-and-format function. | ||||||
This placeholder only specifies the selector’s argument name with a distinct prefix. | ||||||
It must not specify a function. | ||||||
In the example above, the `{#count}` value is the input $count minus the offset, | ||||||
like the `#` in an ICU PluralFormat, which is the input to the plural rules evaluation. | ||||||
This is not allowed for argument names used in select-only functions. | ||||||
- TODO: Bike-shedding on the prefix character, shown as `#` here. | ||||||
|
||||||
Inside selected patterns, | ||||||
the selector argument variables must not be used with the normal `$` placeholder syntax – | ||||||
for example, the patterns in the preceding example must not use `{$count}`. | ||||||
Comment on lines
+216
to
+218
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I can't write:
This seems hard for users to understand. They passed the argument by name. Why can't the format it? It isn't like the value has been consumed by whatever selector ate it previously. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For plurals in particular, the formatting and selection are tied at the hip. If the spelled-out version of the number does not work grammatically like the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ignore the style. The point I'm making is that your text says I cannot use the variable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, for the stated reasons. Don't give users rope to hang themselves if we can avoid it :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a problem with the following
The plural categories are tied to the hip with the formatting. With the input number 1.01d, in some languages the category is 'one' if the format is an integer, but 'other' if the format has one (or more) decimals. So you can't actually correctly compute the plural category until you've formatted. There are two ways to solve this:
It actually works pretty nicely to have the formatter return the plural category as an (optional) byproduct of formatting, because an intermediate step to producing the formatted number is typically the exact data necessary to compute the plural category. So the cleanest is to have a syntax that draws on that in some way. There are of course a few ways to do that. One is to use an assignment, and the other would be to have the formatting options in the selector, eg
|
||||||
Allowing that would be doubly confusing: | ||||||
- It would not be clear which value is inserted. | ||||||
In the example, the plural offset is subtracted from the input value, | ||||||
and the formatted version of that is what is used for | ||||||
evaluating the plural rules and inserting into the pattern. | ||||||
- It would not be clear what formatting is applied. | ||||||
The formatting function and options specified in the selector must be used, | ||||||
but `{$count}` would look like the default formatter might be used. | ||||||
Allowing a function-and-options specification here would be even worse. | ||||||
- (If a developer does need a pattern with both the selector-modified and also the original value, | ||||||
then they can pass the value twice into the message formatting function, | ||||||
under different argument names.) | ||||||
|
||||||
# Named expressions | ||||||
|
||||||
When a message contains many variants, it is tedious, verbose, and error-prone to | ||||||
repeat complicated placeholders in many of those variants. | ||||||
We allow the definition of named expressions before the selection. | ||||||
The patterns could then use those names. | ||||||
``` | ||||||
$relDate={$date :relativeDateTime fields=Mdjm} | ||||||
[{$count :plural offset=1} {$gender}] | ||||||
[1 female] {{$name} added you to her circles {$relDate}.} | ||||||
[1 male] {{$name} added you to his circles {$relDate}.} | ||||||
[1 *] {{$name} added you to their circles {$relDate}.} | ||||||
[* *] {{$name} added you and {#count} others to their circles {$relDate}.} | ||||||
``` | ||||||
|
||||||
When a named expression is used in a pattern placeholder, then no function must be specified. | ||||||
The formatting is determined by the given expression. | ||||||
- TODO: Decide whether to use a different prefix for | ||||||
a pattern placeholder that refers to a named expression. | ||||||
Using `$` looks familiar, but | ||||||
a distinct prefix would signal that this is not a normal placeholder, | ||||||
and it would allow for a syntax definition (in the BNF) limited to | ||||||
only the named-expression insertion. | ||||||
|
||||||
The expression name must not be the same as that for any placeholder argument. |
Uh oh!
There was an error while loading. Please reload this page.