Skip to content

Commit

Permalink
CLDR-16811 Update UTS35 UnicodeSet syntax (#3089)
Browse files Browse the repository at this point in the history
* CLDR-16811 Update UnicodeSet syntax to reflect ICU and edge cases

* CLDR-16811 Sembr changes from #3083

* CLDR-16811 Significant whitespace before ^

* CLDR-16811 Incorporate additional changes from #3083

* CLDR-16811 Change words

* CLDR-16811 Remove playground comment

* CLDR-16811 Update '$' phrasing from #3083

* CLDR-16811 Document syntax edge cases, add TODOs

* CLDR-16811 Char-name escapes should reference charName

* CLDR-16811 Clarify meaning of syntax chars in strings

* CLDR-16811 Add @eggrobin's suggestions

* CLDR-16811 Add @eggrobin's suggestions take 2

* CLDR-16811 Put element nonterminal after range

* CLDR-16811 Replace root nonterminal with unicodeSet

* CLDR-16811 Remove support for multi-codepoint-string-ranges

* CLDR-16811 Remove mention of single-quote escaping

* CLDR-16811 Significant whitespace in bracketedHex worded out

* CLDR-16811 Significant whitespace in bracketedHex in EBNF

* CLDR-16811 Copy POSIX special case phrasing from #3083

* CLDR-16811 Improve presentation of single-codepoint range constraint

* CLDR-16811 Rephrase variable section

* CLDR-16811 Switch to XIDS/XIDC

* CLDR-16811 Rename property value nonterminals

* CLDR-16811 Convert syntax special cases to table

* CLDR-16811 Trying to reach a conclusion using @eggrobin's suggestions

* CLDR-16811 Be more explicit about variables in strings

* CLDR-16811 Add syntax error examples for multi escape ranges

* CLDR-16811 Fix missing backslash

* CLDR-16811 Fix syntax edge case

* CLDR-16811 Clarify variable example

* CLDR-16811 Reformat syntax special case table
  • Loading branch information
skius authored Jul 27, 2023
1 parent 7c9612d commit a163c9e
Showing 1 changed file with 117 additions and 27 deletions.
144 changes: 117 additions & 27 deletions docs/ldml/tr35.md
Original file line number Diff line number Diff line change
Expand Up @@ -2861,38 +2861,54 @@ Element content whose display may be affected in this way should include an expl

#### <a name="Unicode_Sets" href="#Unicode_Sets">Unicode Sets</a>

Some attribute values or element contents use _UnicodeSet_ notation. A UnicodeSet represents a finite set of Unicode code points and strings, and is defined by lists of code points and strings, Unicode property sets, and set operators, all bounded by square brackets. In this context, a code point means a string consisting of exactly one code point.
Some attribute values or element contents use _UnicodeSet_ notation.
A UnicodeSet represents a finite set of Unicode code points and strings, and is defined by lists of code points and strings, Unicode property sets, and set operators, with square brackets for groupings.
In this context, a code point means a string consisting of exactly one code point.

A UnicodeSet implements the semantics in _UTS #18: Unicode Regular Expressions_ [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)] Levels 1 & 2 that are relevant to determining sets of characters. Note however that it may deviate from the syntax provided in [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)], which is illustrative rather than a requirement. There is one exception to the supported semantics, Section [RL2.6](https://www.unicode.org/reports/tr18/#RL2.6) _Wildcards in Property Values_. That feature can be supported in clients such as ICU by implementing a “hook” as is done in the [online UnicodeSet utilities](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bname%3D%2FAPPLE%2F%7D).
A UnicodeSet implements the semantics in _UTS #18: Unicode Regular Expressions_ [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)] Levels 1 & 2 that are relevant to determining sets of characters.
Note however that it may deviate from the syntax provided in [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)].
In particular, Section [RL2.6](https://www.unicode.org/reports/tr18/#RL2.6) _Wildcards in Property Values_ is not supported.
However, that feature can be supported in clients such as ICU by implementing a “hook” as is done in the [online UnicodeSet utilities](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bname%3D%2FAPPLE%2F%7D).

A UnicodeSet may be cited in specifications outside of the domain of LDML. In such a case, the specification may specify a subset of the syntax provided here.
A UnicodeSet may be cited in specifications outside of the domain of LDML.
In such a case, that specification may specify a subset or superset of the syntax provided here.

The following provides EBNF syntax for a UnicodeSet:
##### UnicodeSet syntax #####

| Symbol | Expression | Examples |
| -------------- | -------------------------------------------------------------- | --------------------------------------- |
| `root` | <pre>= prop<br/>\| '[-]'<br/>\| '[' [\\-\\^]? s seq+ ']'</pre> | \\p{x=y},<br/>[abc] |
| `seq` | <pre>= root (s [\\&\\-] s root)* s<br/>\| range s</pre> | [abc]-[cde], a |
| `range` | <pre>= char ('-' char)?<br/>\| '{' (s char)+ s '}'</pre> | a, a-c, \{abc} |
| `prop` | <pre>= '\\' [pP] '{' propName ([≠=] s value1+)? '}'<br/>\| '[:' '^'? propName ([≠=] s value2+)? ':]'</pre> | \\p\{x=y}, [:x=y:]<br/> |
| `propName` | <pre>= s [A-Za-z0-9] [A-Za-z0-9_\\x20]* s</pre> | General_Category,<br/>General Category |
| `value1` | <pre>= [^\\}]<br/>\| '\\' quoted</pre> | Lm,<br/>\\n,<br/>\\} |
| `value2` | <pre>= [^:]<br/>\| '\\' quoted</pre> | Lm,<br/>\\n,<br/>\\: |
| `char` | <pre>= [^\\& \\- \\[ \\[ \\] \\\\ \\} \\{ [:Pat_WS:]]<br/>\| '\\' quoted</pre> | a, b, c, \\n |
| `quoted` | <pre>= 'u' (hex{4} \| bracketedHex)<br/>\| 'x' (hex{2} \| bracketedHex)<br/>\| 'U00' ('0' hex{5} \| '10' hex{4})<br/>\| 'N{' propName '}'<br/>\| [[\u0000-\U00010FFFF]-[uxUN]]</pre> | _**error** if lengths not exact_ |
| `charName` | <pre>= s [A-Za-z0-9] [-A-Za-z0-9_\x20]* s</pre> | TIBETAN LETTER -A |
| `bracketedHex` | <pre>= '{' s hexCodePoint (s hexCodePoint)* s '}'</pre> | \{61 2019 62} |
| `hexCodePoint` | <pre>= hex{1,5} \| '10' hex{4}</pre> | |
| `hex` | <pre>= [0-9A-Fa-f]</pre> | |
| `s` | <pre>= [:Pattern_White_Space:]*</pre> | optional whitespace |

Some constraints on UnicodeSet syntax are not captured by this EBNF. Notably, property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. In addition, quoted values that resolve to more than one code point are disallowed in ranges of the form `char '-' char`.
| `unicodeSet` | <pre>= prop<br/>\| '\[' '^'? s '-'? s seq\* \[\\$ \\-\]? s '\]' <br/>\| var</pre> | \\p\{x=y\},<br/>[abc],<br/>$myset |
| `seq` | <pre>= unicodeSet \(s \[\\&\\\-\] s unicodeSet\)\* s<br/>\| range s</pre> | \[abc\]\-\[cde\], a |
| `range` | <pre>= element \('\-' element\)? | a, a\-c, \{abc\}, a\-\{z\} <br/> _note: in ranges, elements must resolve to exactly one code point._ |
| `element` | <pre>= char \| string \| var </pre> | %, b, \{hello\}, \{\}, \\x\{61 62\} |
| `prop` | <pre>= '\\' \[pP\] '\{' propName \(\[≠=\] s pValuePerl\+\)? '\}'<br/>\| '\[:' '^'? propName \(\[≠=\] s pValuePosix\+\)? ':\]'</pre> | \\p\{x=y\}, \[:x=y:\]<br/> |
| `propName` | <pre>= s \[A\-Za\-z0\-9\] \[A\-Za\-z0\-9\_\\x20\]\* s</pre> | General\_Category,<br/>General Category |
| `pValuePerl` | <pre>= \[^\\\}\]<br/>\| '\\' quoted</pre> | Lm,<br/>\\n,<br/>\\\} |
| `pValuePosix` | <pre>= \[^:\]<br/>\| '\\' quoted</pre> | Lm,<br/>\\n,<br/>\\: |
| `string` | <pre>= '\{' \(s charInString\)\* s '\}' </pre> | \{hello\} |
| `char` | <pre>= \[^ \\^ \\& \\\- \\\[ \\\] \\\\ \\\{ \\$ \[:Pat_WS:\]\]<br/>\| '\\' quoted</pre> | a, b, c, \\n, \\\{, \\$ |
| `charInString` | <pre>= \[^ \\\\ \\\} \[:Pat_WS:\]\]<br/>\| '\\' quoted</pre> | a, b, c, \\n, \{, $ |
| `quoted` | <pre>= 'u' \(hex\{4\} \| bracketedHex\)<br/>\| 'x' \(hex\{2\} \| bracketedHex\)<br/>\| 'U00' \('0' hex\{5\} \| '10' hex\{4\}\)<br/>\| 'N\{' charName '\}'<br/>\| \[\[\\u0000\-\\U00010FFFF\]\-\[uxUN\]\]</pre> | n, U0000FFFE, \{, $, \] <br/> _note: lengths are exact_ |
| `charName` | <pre>= s \[A\-Za\-z0\-9\] \[\-A\-Za\-z0\-9\_\\x20\]\* s</pre> | TIBETAN LETTER \-A |
| `bracketedHex` | <pre>= '\{' s hexCodePoint \(sRequired hexCodePoint\)\* s '\}'</pre> | \{61 2019 62\}, \{61\} |
| `hexCodePoint` | <pre>= hex\{1,5\} \| '10' hex\{4\}</pre> | |
| `hex` | <pre>= \[0\-9A\-Fa\-f\]</pre> | |
| `var` | <pre>= '$' \[:XID_Start:\] \[:XID_Continue:\]\*</pre> | $a, $elt5 (optional support) |
| `s` | <pre>= \[:Pattern_White_Space:\]\*</pre> | optional whitespace |
| `sRequired` | <pre>= \[:Pattern_White_Space:\]\+</pre> | required whitespace |

Some constraints on UnicodeSet syntax are not captured by this EBNF.
Notably:
1. Property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)].
2. Escapes that use multiple code points are equivalent to their flattened representation, i.e., `\x{61 62}` is equivalent to `\x{61}\x{62}`. These can also occur in strings, so **\[\{\\x\{ 061 62 0063\}\}\]** is equivalent to **\[\{abc\}\]**.
3. Ranges (**X**-**Y**) are only supported in the case that elements **X** and **Y** resolve to single code points. That is, **\[a-b\]** and **\[\{a\}-\{b\}\]** are supported, while **\[a-{bz}\]** and **\[\{ax\}-\{bz\}\]** are not, because single-codepoint-strings are equivalent to that code point.
4. If **\[\]** starts with \[:, then it begins a prop, and must also terminate with :\]. Thus **\[:di:\]** is a valid property expression, **\[di:\]** is a 3 code-point set, and **\[:di\]** raises an error. Whitespace is significant when initiating/terminating a POSIX property expression, so **\[ :\]** is syntactically valid and equivalent to **\[\\:\]**.

The syntax characters are listed in the table below:

| Char | Hex | Name | Usage |
| ---- | ------ | -------------------- | ------------------------------------------ |
| $ | U+0024 | DOLLAR SIGN | Equivalent of \\uFFFF (This is for implementations that return \\uFFFF when accessing before the first or after the last character) |
| $ | U+0024 | DOLLAR SIGN | Equivalent to \\uFFFF when followed by '\]', initiator for variable identifiers otherwise |
| & | U+0026 | AMPERSAND | Intersecting UnicodeSets |
| - | U+002D | HYPHEN-MINUS | Ranges of characters; also set difference. |
| : | U+003A | COLON | POSIX-style property syntax |
Expand All @@ -2904,17 +2920,55 @@ The syntax characters are listed in the table below:
| } | U+007D | RIGHT CURLY BRACKET | Strings in set; Perl property syntax |
| | U+0020 U+0009..U+000D U+0085<br/>U+200E U+200F<br/>U+2028 U+2029 | ASCII whitespace,<br/>LRM, RLM,<br/>LINE/PARAGRAPH SEPARATOR | Ignored except when escaped |

Note that some syntax characters only have a special meaning in a certain context. In particular:
* Out of all above syntax characters, only \\, \}, and whitespace have a special meaning inside strings (**\[\{\[a-z\]\}\]** is the set of the string '\[a-z\]', **\[\{\$blah\}\]** is the set of the string '\$blah').
* \$ is equivalent to \uFFFF when appearing at the very end of a set with or without trailing whitespace (**[a-z\$]**, **[a-z\$ ]**), and used as starting indicator for a variable reference elsewhere, in which case the variable name will be the longest match on the `var` nonterminal (such as **[\$my_set]**).
* \- is equivalent to the literal character \\- when occuring at the very beginning of a set, after a \^ at the beginning of a set, or at the very end of a set, in all cases with or without whitespace (**[-abc]**, **[ ^ -abc]**, **[abc-]**), and used as the set difference or range operator elsewhere (**[[abc]-[bc]]**, **[a-z]**)
* \: initiates a POSIX property set when directly after a \[ without whitespace inbetween (**[:L:]**), ends a POSIX property set when directly before a \] without whitespace inbetween (**[:L:]**), and is equivalent to the literal character \\\: in any other place (**[ \:]**, **[L\:]**)
* \} ends a string when occurring inside a string (**[{hello}]**), and is equivalent to the literal character \\\} in any other place (**[}a]**)

###### Syntax Special Case Examples
In the following, a table of examples including common sources of confusion concerning the UnicodeSet syntax:
| Expression | Contained Elements | Syntax Errors |
| - | - | - |
| **\[^a\]** | All Unicode code points except 'a' | **\[ ^a\]**, **\[a^\]** |
| **\[\\^a\]** | 'a' and '^' | |
| **\[:L:\]** | All code points with Unicode property 'General_Category' equal to 'Letter' | **\[:L\]**, **\[:\]** |
| **\[ :\]** | ':' | |
| **\[L:\]** | 'L' and ':' | |
| **\[-\]** | '-'. | |
| **\[ - \]** | '-' | |
| **\[a-\]**, **\[-a\]** | 'a' and '-' | |
| **\[a -b\]** | All code points between 'a' and 'b' (inclusive) | |
| **\[\[a-b\] -\[b\]\]**, **\[\[a\]-\[b\]-\[c\]\]** | 'a' | **\[a-b-c\]** |
| **\[^ - \]** | All Unicode code points except '-' | **\[ ^ - \]** |
| **\[\$\]**, **\[ \$ \]** | U+FFFF | |
| **\[\$a\]** | The value of the variable '\$a' | **\[\$ a\]**, **\[\$und\]** |
| **\[\$a\$\]** | U+FFFF and the value of the variable '\$a' | |
| **\[a\$\]** | 'a' and U+FFFF | |
| **\[\}\]** | '\}' | **\[\{\]** |
| **\[\{\}\]** | the empty string, '' | |
| **\[\{\}\}\]** | '\}' and the empty string, '' | |
| **\[\{\{\}\]** | '\{' | |
| **\[\{\$var\}\]** | the string '\$var' | |
| **\[\{\[a-z\}\]**, **\[\{ \[ a - z\}\]** | the string '\[a-z' | |
| **\[\\x\{10FFFF 1\}\]** | U+10FFFF and U+1 | **\[\\x\{10FFFF1\}\]** |
| **\[\\x\{61\}-d\]** | 'a', 'b', 'c', and 'd' | **\[\\x\{61 63\}-d\]**, **\[\\x\{61 63\}-\\x\{62 64\}\]** |

*Note: the above assumes that variables are supported, \$a is defined as a full UnicodeSet, a string, or a char, and \$und is not defined at all.*





##### <a name="Lists_of_Code_Points" href="#Lists_of_Code_Points">Lists of Code Points</a>

Lists are a sequence of strings that may include ranges, which are indicated by a '-' between two code points, as in "a-z". The sequence _start-end_ specifies the range of all code points from the start to end, inclusive, in Unicode order. For example, **[a c d-f m]** is equivalent to **[a c d e f m]**. Whitespace can be freely used for clarity, as **[a c d-f m]** means the same as **[acd-fm]**.

A string with multiple code points is represented in a list by being surrounded by curly braces, such as in **[a-z \{ch}]**. It can be used with the range notation, as described in _Section String Range](#String_Range)_ . There is an additional restriction on string ranges in a UnicodeSet: the number of codepoints in the first string of the range must be identical to the number in the second. Thus [\{ab}-\{c}] and [\{ab}-c] are invalid.
A string with multiple code points is represented in a list by being surrounded by curly braces, such as in **[a-z \{ch}]**. It can be used with the range notation, with the restriction that each string contains exactly one code point. Thus **\[\{ab\}-\{c\}\]**, **\[\{ax\}-\{bz\}\]**, and **\[\{ab\}-c\]** are invalid. A string consisting of a single code point is equivalent to that code point, that is, **[\{a}-c]** is valid and equivalent to **[a b c]**.

In UnicodeSets, there are two ways to quote syntax code points:

<a name="Backslash_Escapes"></a>
Outside of single quotes, certain backslashed code point sequences can be used to quote code points:
##### <a name="Backslash_Escapes" href="#Backslash_Escapes">Backslash Escapes</a>
Certain backslashed code point sequences can be used to quote code points:

| Sequence | Code point |
| --------------- | ------------------------------------ |
Expand Down Expand Up @@ -2972,6 +3026,43 @@ The binary operators '&', '-', and the implicit union have equal precedence and

**One caution:** the '&' and '-' operators operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern **[[:Lu:]-A]** is illegal, since it is interpreted as the set **[:Lu:]** followed by the incomplete range **-A**. To specify the set of upper case letters except for 'A', enclose the 'A' in brackets: **[[:Lu:]-[A]]**.

##### <a name="Variables_in_UnicodeSets" href="#Variables_in_UnicodeSets">Variables in UnicodeSets</a>

Support for variable identifiers (var) is optional.
They are used in certain contexts such as in [https://cldr-smoke.unicode.org/spec/main/ldml/tr35-general.html#Transforms](Transforms).
When they are used, they are defined as follows:

UnicodeSets may contain variables (`$my_char`, `$the_set`, ...) in place of full UnicodeSets and strings/characters. If variable support is enabled, variables must be defined (out-of-scope for UnicodeSets). In particular, referring to undefined variables is an error.

Not all variable maps are valid for a given expression in UnicodeSet syntax.
For instance, consider **[$a-$b]**; this may be a range of characters if both **$a** and **$b** are characters,
or a difference of sets if they are both sets; but given the map `{ a => '0', b => [:L:] }`, it is invalid.

**Note:** In particular, the variable map is needed not just to compute the actual set of characters and strings represented by the UnicodeSet,
but also to parse the UnicodeSet syntax: if **$a** and **$b** were unknown, the parsing of **[$a-$b]** would be ambiguous.

Variables are replaced by value, that is, **[a \$minus z]** with a variable map `{ minus => '-' }` is equivalent to **[-az]**, not **[a-z]** (i.e., cardinality of 3 instead of 26).
The full `var` nonterminal is replaced, i.e., the variable name together with the prefixed \$.

The variable syntax implements UAX31-R1-2 with XID_Start and XID_Continue. For more information, see [[UAX31](https://www.unicode.org/reports/tr41/#UAX31)].
Variables are equivalent normalized identifiers with Normalization Form C, implementing UAX31-R4. Furthermore, variables are case-sensitive.


Notes:
1. The 'type' of a variable value is not specified syntactically.
Thus \[\$a\-\$b\] can resolve whether \$a and \$b are chars/strings (eg, \$a=δ, \$b=θ) or full UnicodeSets (eg, \$a=\\p\{script=greek\}, \$b=\\p\{general_category=letter\}).
The only restriction is that the result be syntactic; thus (\$a=w, \$b=xy) would raise an error.
2. Variable substitution is currently disallowed inside of property expressions.
Thus \\p{gc=\$blah} raises an error.
3. '\$' when followed by '\]' is interpreted as \\uFFFF, and is used to match before the start of a string or after the end.
Thus \[ab\$\] matches the string "xaby" in the locations (marked with '()'): "()xaby", "x(a)by", "xa(b)y", "xaby()".
4. If an unescaped '\$' is neither followed by a character of type \[:XID_Start:\] nor a '\]', it is a syntax error.

**Backwards compatibility**: In prior versions of this document, the character \$ was a valid element of the `char` nonterminal with the special meaning of `\uFFFF`.
In current versions, the \$ character may only appear by itself at the end of a UnicodeSet, e.g., **[a-z\$]**, where it keeps that interpretation.
Allowing \$ to appear in any other location is only allowed as the prefix for variables.
The previous behavior of allowing \$ in the `char` nonterminal is considered obsolete and must be avoided by new implementations.

##### <a name="UnicodeSet_Examples" href="#UnicodeSet_Examples">UnicodeSet Examples</a>

The following table summarizes the syntax that can be used.
Expand All @@ -2986,7 +3077,6 @@ The following table summarizes the syntax that can be used.
| [[pat1]-[pat2]] | The asymmetric difference of sets specified by pat1 and pat2 |
| [a \{ab} \{ac}] | The code point 'a' and the multi-code point strings "ab" and "ac" |
| [x\\u\{61 2019 62}y] | Equivalent to [x\\u0061\\u2019\\u0062y] (= [xa’by]) |
| [\{ax}-\{bz}] | The set containing [\{ax} \{ay} \{az} \{bx} \{by} \{bz}], using the range syntax to get all the strings from \{ax} to \{bz} as described in _Section String Range](#String_Range)_. |
| [:Lu:] | The set of code points with a given property value, as defined by PropertyValueAliases.txt. In this case, these are the Unicode upper case letters. The long form for this is **[:General_Category=Uppercase_Letter:]**. |
| [:L:] | The set of code points belonging to all Unicode categories starting with 'L', that is, **[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]**. The long form for this is **[:General_Category=Letter:]**. |

Expand Down

0 comments on commit a163c9e

Please sign in to comment.