-
Notifications
You must be signed in to change notification settings - Fork 0
/
p2071r0.html
1464 lines (1347 loc) · 55.3 KB
/
p2071r0.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<head>
<title>
Named universal character escapes
</title>
<style type="text/css">
body {
max-width: 1600px;
}
table#header th,
table#header td
{
text-align: left;
}
table#references th,
table#references td
{
vertical-align: top;
}
#hideins:checked ~ * ins, #hideins:checked ~ * ins * { display:none; visibility:hidden }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
ins, ins *
{
text-decoration: underline;
color: #000000;
background-color:#C8FFC8
}
del, del *
{
text-decoration: line-through;
color: #000000;
background-color:#FFA0A0
}
blockquote
{
color: #000000;
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
padding-left: 0.5em;
padding-right: 0.5em;
}
blockquote.stdins
{
color: #000000;
background-color: #C8FFC8;
border: 1px solid #B3EBB3;
padding: 0.5em;
}
blockquote.stddel
{
text-decoration: line-through;
color: #000000;
background-color: #FFA0A0;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
</style>
</head>
<body>
<table id="header">
<tr>
<th>Document Number:</th>
<td>P2071R0</td>
</tr>
<tr>
<th>Date:</th>
<td>2020-01-13</td>
</tr>
<tr>
<th>Audience:</th>
<td>SG16, EWG</td>
</tr>
<tr>
<th>Reply-to:</th>
<td>Tom Honermann <tom@honermann.net><br/>
Peter Bindels <peterbindels@gmail.com></td>
</tr>
</table>
<h1>
Named universal character escapes
</h1>
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#motivation">Motivation</a></li>
<li><a href="#design">Design considerations</a>
<ul>
<li><a href="#design_syntax">Syntax</a>
<li><a href="#design_names">Name sources</a>
<li><a href="#design_matching">Name matching</a>
<li><a href="#design_portability">Portable names</a>
<li><a href="#design_existing_practice">Existing practice</a>
<li><a href="#design_compat">Backward compatibility</a>
<li><a href="#design_impact">Implementor impact</a>
<li><a href="#design_alt">Design alternatives</a>
</ul>
</li>
<li><a href="#proposal">Proposal</a></li>
<li><a href="#proposal_opts">Proposal options</a></li>
<li><a href="#future">Possible future extensions</a></li>
<li><a href="#implementation_exp">Implementation experience</a></li>
<li><a href="#acknowledgements">Acknowledgements</a></li>
<li><a href="#references">References</a></li>
<li><a href="#core_wording">Core wording</a></li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>
This proposal continues the effort R. Martinho Fernandes initiated that
culminated in
<a title="Named character escapes"
href="https://wg21.link/p1097r2">
P1097R2</a><sup><a title="Named character escapes"
href="#ref_p1097r2">[P1097R2]</a></sup>.
This proposal does not deviate from the general design intent in Fernandes'
work, but does deviate in the following specific details:
<ul>
<li>This proposal uses
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
for matching names rather than just case-insensitive matching. This is
primarily motivated by implementation concerns; ignoring spaces allows
for a more efficient implementation.
</li>
<li>This proposal includes a feature test macro.</li>
</ul>
</p>
<p>
C++ programmers have been able to portably use characters outside of the basic
source character set in character and string literals since the introduction of
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
in C++11. For example:
<div style="margin-left: 1em;">
<pre><code class="c++">U'\u0100' // UTF-32 character literal with U+0100 {LATIN CAPITAL LETTER A WITH MACRON}
u8"\u0100\u0300" // UTF-8 string literal with U+0100 {LATIN CAPITAL LETTER A WITH MACRON} U+0300 {COMBINING GRAVE ACCENT}</code></pre>
</div>
</p>
<p>
This proposal enables the above literals to be written using Unicode assigned
names instead of Unicode code point values.
<div style="margin-left: 1em;">
<pre><code class="c++">U'\N{LATIN CAPITAL LETTER A WITH MACRON}' // Equivalent to U'\u0100'
u8"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT" // Equivalent to u8"\u0100\u0300"</code></pre>
</div>
</p>
<p>
Prior presentations of P1097 to EWG-I and EWG received strong encouragement:
<ul>
<li>Poll of
<a title="P1097R1: Named character escapes"
href="https://wg21.link/p1097r1">
P1097R1</a><sup><a title="P1097R1: Named character escapes"
href="#ref_p1097r1">[P1097R1]</a></sup>
in
<a href="http://wiki.edg.com/bin/view/Wg21sandiego2018/P1097R1">EWG-I in San Diego, 2018</a>:
<div style="margin-left: 1em;">
Do we want named escape sequences?
<table border="1" style="border-collapse: collapse">
<tr><th>SF</th><th>F</th><th>N</th><th>A</th><th>SA</th></tr>
<tr><td>5</td><td>9</td><td>7</td><td>0</td><td>0</td></tr>
</table>
</div>
</li>
<li>Poll of
<a title="P1097R2: Named character escapes"
href="https://wg21.link/p1097r2">
P1097R2</a><sup><a title="P1097R2: Named character escapes"
href="#ref_p1097r2">[P1097R2]</a></sup>
in
<a href="http://wiki.edg.com/bin/view/Wg21belfast/P1097-EWG">EWG in Belfast, 2019</a>:
<div style="margin-left: 1em;">
EWG wants to encourage further work in this area
<table border="1" style="border-collapse: collapse">
<tr><th>SF</th><th>F</th><th>N</th><th>A</th><th>SA</th></tr>
<tr><td>8</td><td>16</td><td>8</td><td>1</td><td>1</td></tr>
</table>
</div>
</li>
</ul>
</p>
<p>
Two areas of concern were raised during
<a href="http://wiki.edg.com/bin/view/Wg21belfast/P1097-EWG">discussion in EWG in Belfast, 2019</a>:
<ul>
<li><b>Implementation impact</b><br/>
The Unicode name database (names and aliases), in text form, is ~1.5 MiB
and a naive implementation could significantly impact the size of compiler
distributions. This was of particular concern to organizations that
distribute compilers as part of a distributed build process.
</li>
<li><b>Design concerns</b><br/>
One EWG member strongly preferred a library based design that would have
a smaller impact on the core language. For example, a string
interpolation based design.
</li>
</ul>
This paper discusses and links to work completed by Corentin Jabot that
investigates implementation impact, though an implementation has not yet been
completed. This paper also includes discussion regarding alternative design
possibilities.
</p>
<h1 id="motivation">Motivation</h1>
<p>
The introduction of
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
in C++11 benefitted programmers by allowing them to portably encode characters
outside of the basic source character set without having to resort to use of
octal or hexadecimal
<a href="http://eel.is/c++draft/lex.ccon#nt:escape-sequence"><em>escape-sequence</em></a>s
to explicitly encode code units. However, Unicode code points by themselves do
not clearly communicate to readers of the code which character is to be encoded;
hence the code comments included with the code examples in the introduction.
Allowing programmers to directly use Unicode assigned character names avoids the
need for side channel communications, like code comments, that might get out of
sync over time.
</p>
<p>
Use of UTF-8 as the encoding for source files has increased over time, but
impediments to adoption remain. For example, Microsoft Visual C++ still
defaults to a locale dependent encoding and that encourages limiting source
files to ASCII. If the C++ community were to migrate en masse to UTF-8,
then one might question whether
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
would become a legacy backward compatibility feature since programmers could
reliably type the intended character in their source code directly. And if
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
were to become an anachronism, then what use would be served by introducing
a named character escape?
</p>
<p>
Unicode defines a number of characters that, even when they can be typed
directly, can result in confusion. These include invisible characters
such as U+200B {ZERO WIDTH SPACE}, combining characters such as U+0300
{COMBINING GRAVE ACCENT}, visually indistinct characters such as U+003B
{SEMICOLON} and U+037E {GREEK QUESTION MARK}, and characters with
RTL (right-to-left) directionality. Consider how the following string
literals containing these characters are rendered. In cases like these,
use of escape sequences improves clarity; thus motivation for use of Unicode
escape sequences will remain.
<div style="margin-left: 1em;">
<table style="border:1px solid black">
<tr>
<td>
<tt>""</tt><br/>
<tt>""</tt><br/>
<tt>"̀"</tt><br/>
<tt>";"</tt><br/>
<tt>";"</tt><br/>
<tt>"´"</tt><br/>
<tt>"́"</tt><br/>
<tt>"´"</tt><br/>
<tt>"Ω"</tt><br/>
<tt>"Ω"</tt><br/>
<tt>"A"</tt><br/>
<tt>"Α"</tt><br/>
<tt>"А"</tt><br/>
<tt>"Ꭺ"</tt><br/>
<tt>"ꓮ"</tt><br/>
<tt>"𐊠" </tt><br/>
<tt>"𖽀" </tt><br/>
</td>
<td>
<tt>// U+0000200B {ZERO WIDTH SPACE}</tt><br/>
<tt>// U+0000200F {RIGHT-TO-LEFT MARK}</tt><br/>
<tt>// U+00000300 {COMBINING GRAVE ACCENT}</tt><br/>
<tt>// U+0000003B {SEMICOLON}</tt><br/>
<tt>// U+0000037E {GREEK QUESTION MARK}</tt><br/>
<tt>// U+000000B4 {ACUTE ACCENT}</tt><br/>
<tt>// U+00000301 {COMBINING ACUTE ACCENT}</tt><br/>
<tt>// U+00001FFD {GREEK OXIA}</tt><br/>
<tt>// U+000003A9 {GREEK CAPITAL LETTER OMEGA}</tt><br/>
<tt>// U+00002126 {OHM SIGN}</tt><br/>
<tt>// U+00000041 {LATIN CAPITAL LETTER A}</tt><br/>
<tt>// U+00000391 {GREEK CAPITAL LETTER ALPHA}</tt><br/>
<tt>// U+00000410 {CYRILLIC CAPITAL LETTER A}</tt><br/>
<tt>// U+000013AA {CHEROKEE LETTER GO}</tt><br/>
<tt>// U+0000A4EE {LISU LETTER A}</tt><br/>
<tt>// U+000102A0 {CARIAN LETTER A}</tt><br/>
<tt>// U+00016F40 {MIAO LETTER ZZYA}</tt><br/>
</td>
</tr>
</table>
</div>
</p>
<p>
Named character escapes are supported in various forms in other programming
languages. The following is the result of a brief survey of various languages.
For languages that include such support, more details can be found in the
<a href="#design">Design considerations</a> section.
<div style="margin-left: 1em;">
<table border="1" style="border-collapse: collapse">
<tr>
<th style="text-align:left">Language</th>
<th style="text-align:left">Named character escape support</th>
</tr>
<tr> <td>C#</td> <td>No</td> </tr>
<tr> <td>D</td> <td>Yes; HTML 5 named character references</td> </tr>
<tr> <td>Go</td> <td>No</td> </tr>
<tr> <td>Java</td> <td>No</td> </tr>
<tr> <td>Javascript</td> <td>No</td> </tr>
<tr> <td>Perl</td> <td>Yes; Unicode names, aliases, and named sequences</td> </tr>
<tr> <td>PHP</td> <td>No</td> </tr>
<tr> <td>Python</td> <td>Yes; Unicode names and aliases</td> </tr>
<tr> <td>Raku</td> <td>Yes; Unicode names, aliases, named sequences, and emoji sequences</td> </tr>
<tr> <td>Ruby</td> <td>No</td> </tr>
<tr> <td>Rust</td> <td>No</td> </tr>
<tr> <td>Swift</td> <td>No</td> </tr>
<tr> <td>Visual Basic</td> <td>No</td> </tr>
</table>
</div>
</p>
<h1 id="design">Design considerations</h1>
<p>
There are numerous choices for how support for named characters can be
integrated into C++. Useful questions for making design choices include:
<ul>
<li>Which names will be recognized? Can multiple names for the same character exist?</li>
<li>How will names be matched? Must they be exact? Case insensitive?</li>
<li>How will support for new names affect backward compatibility?</li>
<li>How will the requirement for a name database impact implementations?</li>
<li>What syntax to use?</li>
<li>What is existing practice in other languages?</li>
</ul>
This section analyzes the various options considered for this proposal.
</p>
<h2 id="design_syntax">Syntax</h2>
<p>
Named character escapes are proposed as a more readable alternative to
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name">universal-character-name</a>s.
As such, it is desirable that they be similar in syntax to
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name">universal-character-name</a>s
and other existing escape sequences.
</p>
<p>
The syntax proposed by Fernandes in
<a title="Named character escapes"
href="https://wg21.link/p1097r2">
P1097R2</a><sup><a title="Named character escapes"
href="#ref_p1097r2">[P1097R2]</a></sup>
is modeled after the syntax adopted for Python and consists of a <tt>\N</tt>
escape introducer followed by a name enclosed in curly brackets. For example:
<div style="margin-left: 1em;">
<pre><code class="c++">'\N{LATIN CAPITAL LETTER A}'
"\N{LATIN CAPITAL LETTER A WITH MACRON}"
</code></pre>
</div>
</p>
<p>
Other choices for the escape introducer are possible; the
<a href="#design_compat">Backward compatibility</a>
section discusses some possible motivation for preferring
<tt>\u</tt> and/or <tt>\U</tt> and the
<a href="#proposal_opts">Proposal options</a>
section includes this alternate syntax as an option.
</p>
<p>
Options for recognized names and how to match them are discussed in
subsequent sections.
</p>
<p>
As proposed, only one name is allowed per named character escape, but that
is an artificial limitation. Raku allows a sequence of comma separated names
to be specified in a single escape. This is a natural extension if names
are permitted to identify sequences of characters instead of a single
character. The following would all be equivalent. This proposal leaves this
option to a future extension; see the
<a href="#future">Possible future extensions</a> section.
<div style="margin-left: 1em;">
<pre><code class="c++">"\N{LATIN CAPITAL LETTER A WITH MACRON, COMBINING GRAVE ACCENT}"
"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}"
"\u0100\u0300"
</code></pre>
</div>
</p>
<p>
Perl and Raku both allow Unicode code point numbers to be specified as character
names and could enable a syntax that avoids the strict 4 or 8 number
requirements of
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name">universal-character-name</a>s
as well as the natural <tt>U+NNNN</tt> style frequently used to identify
Unicode characters. The following could all be equivalent. This proposal
also leaves this option for a future extension as discussed in the
<a href="#future">Possible future extensions</a> section.
<div style="margin-left: 1em;">
<pre><code class="c++">"\N{U+0100}"
"\N{U+100}"
"\N{U+00000100}"
"\N{0x0100}"
"\N{256}"
"\u0100"
</code></pre>
</div>
</p>
<h2 id="design_names">Name sources</h2>
<p>
A named character escape feature is not particularly useful unless accompanied
by at least one source of character names. The following list contains sources
of character names that are consulted by at least one implementation of named
character escapes in another programming language.
<ul>
<li>Unicode assigned names (synchronized with ISO/IEC 10646)<br/>
<a href="https://www.unicode.org/Public/12.0.0/ucd/NamesList.txt">https://www.unicode.org/Public/12.0.0/ucd/NamesList.txt</a>
</li>
<li>Unicode aliases (synchronized with ISO/IEC 10646)<br/>
<a href="https://www.unicode.org/Public/12.0.0/ucd/NameAliases.txt">https://www.unicode.org/Public/12.0.0/ucd/NameAliases.txt</a>
</li>
<li>Unicode named sequences (synchronized with ISO/IEC 10646)<br/>
<a href="https://www.unicode.org/Public/12.0.0/ucd/NamedSequences.txt">https://www.unicode.org/Public/12.0.0/ucd/NamedSequences.txt</a>
</li>
<li>Emoji ZWJ sequences<br/>
<a href="https://www.unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt">https://www.unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt</a>
</li>
<li>Emoji sequences<br/>
<a href="https://www.unicode.org/Public/emoji/4.0/emoji-sequences.txt">https://www.unicode.org/Public/emoji/4.0/emoji-sequences.txt</a>
</li>
<li>HTML named character references<br/>
<a href="https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references">https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references</a>
</li>
</ul>
</p>
<p>
The first three are defined by the Unicode Consortium, part of the Unicode
standard, and synchronized with ISO/IEC 10646. The names specified in each
are designed in concert, share a common namespace, are immutable once
published, and Unicode guarantees no conflicts between them. See the
<a title="Unicode Character Encoding Stability Policies"
href="https://www.unicode.org/policies/stability_policy.html">
Unicode character encoding stability policy</a><sup><a title="Unicode Character Encoding Stability Policies"
href="#ref_ucesp">[UCESP]</a></sup>
for more details. These sources are consulted for named character escapes
in Perl, Python, and Raku.
<p>
<p>
The next two sources specify emoji character sequences. Though produced
by the Unicode Consortium, they are not part of the Unicode standard, and
are not covered by the
<a title="Unicode Character Encoding Stability Policies"
href="https://www.unicode.org/policies/stability_policy.html">
Unicode character encoding stability policy</a><sup><a title="Unicode Character Encoding Stability Policies"
href="#ref_ucesp">[UCESP]</a></sup>.
These two sources don't technically provide names; they provide optional
descriptions. The provided descriptions use characters, particularly
<tt>:</tt> and <tt>,</tt>, that are disallowed in the names provided
by the first three sources. These sources are consulted for named character
escapes in Raku.
</p>
<p>
The last source is the specification of names recognized for use as named
character references in HTML documents. This source is used for the
implementation of named character escapes in the D programming language.
</p>
<p>
The stability guarantees offered by the Unicode standard are a strong motivator
for their use and, as such, this proposal adopts them as the name sources to
use.
</p>
<p>
The list of Unicode assigned names associates at most one name with each
character. There are some characters that are not assigned a name in this
list, for example, U+0080 is simply listed as a <tt><control></tt>
character with no name. In some of these cases, the Unicode aliases list
provides one or more names. For example, U+0080 has assigned aliases of
<tt>PADDING CHARACTER</tt> (a figment alias) and <tt>PAD</tt>
(an abbreviation alias).
</p>
<p>
Unicode aliases provide another critical service. As mentioned above,
once assigned, names are immutable. Corrections are only offered by providing
an alias. Aliases come in five varieties:
<ul>
<li><b>correction</b><br/>
Aliases for cases where an incorrect assigned name was published.
For example, U+FE18 has an assigned name of
<tt>PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET</tt>
and a correction alias of
<tt>PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET</tt>
(note the typo correction).
</li>
<li><b>control</b><br/>
Aliases for various control characters. For example, U+0000 as a
control alias of <tt>NULL</tt>.
</li>
<li><b>alternate</b><br/>
Aliases for widely used alternate names. For example,
<tt>BYTE ORDER MARK</tt> for U+FEFF.
</li>
<li><b>figment</b><br/>
Aliases for names that were documented, but never accepted in a standard.
For example, <tt>HIGH OCTET PRESET</tt> for U+0081.
</li>
<li><b>abbreviation</b><br/>
Aliases for common abbreviations. For example,
<tt>NBSP</tt> for U+00A0.
</li>
</ul>
</p>
<p>
It is conceivable that implementors could desire, or be requested to, support
additional implementation-defined names; perhaps including from the
additional sources listed above. Since new characters and names will continue
to be added to the Unicode standard, caution is warranted to avoid the
possibility of introducing conflicting names over time. The description of the
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
name matching algorithm describes a historical case of how such a conflict
once occurred. Any support for additional names should ensure that they
occupy a non-overlapping namespace with the Unicode assigned names. Out of
caution, this proposal disallows additional implementation-defined names.
</p>
<h2 id="design_matching">Name matching</h2>
<p>
Names can be finicky things. Having to remember whether a name is, for example,
<tt>ZERO WIDTH SPACE</tt> or <tt>ZERO-WIDTH SPACE</tt> is likely to frustrate
programmers. Some programmers might prefer <tt>zero width space</tt>.
</p>
<p>
Unicode provides a straight forward algorithm for matching names with various
allowances including case-insensitivity, omission of some hyphens (<tt>-</tt>),
and substitution of underscore (<tt>_</tt>) for space characters.
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
is included in the Unicode standard via
<a title="Unicode Standard Annex #44 - Unicode Character Database"
href="https://www.unicode.org/reports/tr44/tr44-24.html">
Unicode Standard Annex #44</a><sup><a title="Unicode Standard Annex #44 - Unicode Character Database"
href="#ref_uax44">[UAX#44]</a></sup>.
</p>
<p>
The
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
matching rule would accept any of the following names as a match for
U+200B {ZERO WIDTH SPACE}
<div style="margin-left: 1em;">
<pre><code class="c++"><tt>ZERO WIDTH SPACE</tt>
<tt>ZERO-WIDTH SPACE</tt>
<tt>zero-width space</tt>
<tt>ZERO width S P_A_C E</tt></code></pre>
</div>
</p>
<h2 id="design_portability">Portable names</h2>
<p>
Portably using named character escapes will require implementations to agree
on a minimum version of the name sources.
</p>
<p>
Thanks to the adoption of
<a title="Update The Reference To The Unicode Standard"
href="https://wg21.link/p1025r1">
P1025R1</a><sup><a title="Update The Reference To The Unicode Standard"
href="#ref_p1025r1">[P1025R1]</a></sup>
in Rapperswil, 2019, the C++ standard has a normative floating reference to
<a title="Information technology — Universal Coded Character Set (UCS)"
href="https://www.iso.org/standard/69119.html">
ISO/IEC 10646</a><sup><a title="Information technology — Universal Coded Character Set (UCS)"
href="#ref_10646">[ISO/IEC10646]</a></sup>,
the ISO/IEC standard that specifies a subset of what is specified in the
Unicode standard and is kept synchronized with it. ISO/IEC 10646:2017
includes the
Unicode assigned names (in section 33),
name aliases (in section 33), and
named character sequences (in section 27).
</p>
<p>
The floating reference to ISO/IEC 10646 indicates a dependence on the version
that is current at the time of standardization. Thus, conformance with the
C++ standard will require conformance with the latest available publication
of ISO/IEC 10646.
</p>
<p>Implementors must be allowed, and encouraged, to conform to more recent
versions of ISO/IEC 10646 as they are published.
</p>
<h2 id="design_existing_practice">Existing practice</h2>
<p>
Support for named escape sequences exists in several programming languages.
The following details of existing practice were obtained from these
documentation sources. The author has not verified the accuracy of this
information.
<table style="border:1px solid black">
<tr><th>Language</th><th>Documentation link</th></tr>
<tr>
<td>D</td>
<td><a href="https://dlang.org/spec/lex.html#StringLiteral">https://dlang.org/spec/lex.html#StringLiteral</a></td>
</tr>
<tr>
<td>Perl</td>
<td><a href="https://perldoc.perl.org/charnames.html">https://perldoc.perl.org/charnames.html</a></td>
</tr>
<tr>
<td>Python</td>
<td><a href="https://docs.python.org/3.8/reference/lexical_analysis.html#literals">https://docs.python.org/3.8/reference/lexical_analysis.html#literals</a></td>
</tr>
<tr>
<td>Raku</td>
<td><a href="https://docs.raku.org/language/unicode#Entering_unicode_codepoints_and_codepoint_sequences">https://docs.raku.org/language/unicode#Entering_unicode_codepoints_and_codepoint_sequences</a></td>
</tr>
</table>
</p>
<p>
Capabilities vary across languages:
<table border="1" style="border-collapse: collapse">
<tr>
<th>Language</th>
<th>Name sources</th>
<th>Comma separated names</th>
<th>Name matching</th>
<th>Matches code<br/>point numbers</th>
</tr>
<tr>
<td>D</td>
<td>HTML 5</td>
<td>No</td>
<td>Exact match?</td>
<td>No</td>
</tr>
<tr>
<td>Perl</td>
<td>Unicode names<br/>
Unicode name aliases<br/>
Unicode named sequences<br/>
registered custom aliases<br/>
</td>
<td>No</td>
<td>
Optionally, script qualified short names<br/>
Optionally, loose matching (case insensitive, ignore underscore, most spaces, and most non-medial hyphens)
</td>
<td>Yes</td>
</tr>
<tr>
<td>Python</td>
<td>Unicode names<br/>
Unicode name aliases<br/>
</td>
<td>No</td>
<td>Case-insensitive</td>
<td>No</td>
</tr>
<tr>
<td>Raku</td>
<td>Unicode names<br/>
Unicode name aliases<br/>
Unicode named sequences<br/>
emoji ZWJ sequences<br/>
emoji sequences<br/>
</td>
<td>Yes</td>
<td>Exact match?</td>
<td>Yes</td>
</tr>
</table>
</p>
<p>
Examples:
<table border="1" style="border-collapse: collapse">
<tr>
<th>Language</th>
<th>Code</th>
</tr>
<tr>
<td>D</td>
<td>
<pre><code class="D">"\&Amacr;"</code></pre>
</td>
</tr>
<tr>
<td>Perl</td>
<td>
<pre><code class="perl">"\N{LATIN CAPITAL LETTER A WITH MACRON}"
"\N{U+0100}"
</code></pre>
</td>
</tr>
<tr>
<td>Python</td>
<td>
<pre><code class="python">"\N{LATIN CAPITAL LETTER A WITH MACRON}"</code></pre>
</td>
</tr>
<tr>
<td>Raku</td>
<td>
<pre><code class="raku">"\c[LATIN CAPITAL LETTER A WITH MACRON]"
"\c[256]"
"\c[LATIN CAPITAL LETTER A WITH MACRON,COMBINING GRAVE ACCENT]"
"\c[LATIN CAPITAL LETTER A WITH MACRON AND GRAVE]"</code></pre>
</td>
</tr>
</table>
</p>
<h2 id="design_compat">Backward compatibility</h2>
<p>
Escape sequences beyond those required in the standard are
conditionally-supported
(<a href="http://eel.is/c++draft/lex.ccon#7.sentence-3">[lex.ccon]p7</a>).
For implementations that currently define a meaning for <tt>\N</tt> in
character or string literals, the use of <tt>\N</tt> in this proposal is
technically a breaking change.
</p>
<p>
Gcc, Clang, and Microsoft Visual C++ all accept <tt>\N</tt> as an escape
sequence with the semantic effect of substituting <tt>N</tt> such that
<tt>"\N{xxx}"</tt> is equivalent to <tt>"N{xxx}"</tt>. However, they each
emit a warning regarding an unrecognized escape sequence, so reliance on
this behavior is not likely to be common. Still, there are likely to be
some uses in the wild (probably some percentage of that were intended to
be <tt>\n</tt>).
</p>
<p>
Another option would be to reuse the <tt>\u</tt> and/or <tt>\U</tt> introducer
used for
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s.
Gcc and Clang both reject code like <tt>"\u{xxx}"</tt> and <tt>"\U{xxx}"</tt>
as containing ill-formed
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s.
However, Microsoft Visual C++ accepts such uses without a warning and
treats them as equivalent to <tt>"u{xxx}</tt> and <tt>"U{xxx}"</tt>
respectively.
</p>
<p>
The implementation divergence that occurs for the <tt>\u</tt> and <tt>\U</tt>
cases above suggests that repurposing them may result in less backward
compatibility. Use of <tt>\u</tt> and/or <tt>\U</tt> would potentially
require more wording changes to distinguish named character escapes from
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s,
but would be unlikely to pose a significant additional impact to implementors.
</p>
<p>
For now, this proposal adheres to Fernandes' original design and retains use
of <tt>\N</tt> as the introducer for named character escapes.
</p>
<h2 id="design_impact">Implementor impact</h2>
<p>
The sources of character names listed in the
<a href="#design_names">Name sources</a>
section do not constitute big data by today's standards, but that does not mean
that the volume of data and potential for impact to compiler distributions and
compiler performance is insignificant. As mentioned earlier, some organizations
have valid technical reasons to be sensitive to the size of the compiler
distributions they use; in a distributed build environment that distributes
compilers, the size of the distribution impacts latency and can therefore
negatively impact build times.
</p>
<p>
The combined size of the Unicode 12.0 text files containing the Unicode assigned
names, aliases, and named character sequences is approximately 1.5 MiB. A
naive implementation might contribute 2+ MiB of code/data to a compiler. Some
EWG members indicated that amount of increase is a cause for concern.
</p>
<p>
Fortunately, naive implementations are not the only option. Corentin Jabot
has done some excellent work to demonstrate that an implementation should be
possible that increases the code/data size of a compiler by less than 300 KiB.
See the
<a href="#implementation_exp">Implementation experience</a> section for details.
Corentin's approach is promising, but the additional complexity caries
additional implementation cost and maintenance.
</p>
<p>
Staying up to date with new Unicode releases will also, of course, pose an
additional cost on implementors.
</p>
<h2 id="design_alt">Design alternatives</h2>
<p>
As indicated previously, at least one EWG member in Belfast was strongly
interested in a more general core language feature, presumably a string
interpolation facility, that would allow named character escapes to be
implemented as a library feature. Such a feature could take many forms,
but might look something like the following where <tt>\{</tt> is an
escape sequence followed by a call to a <tt>constexpr</tt> function named
<tt>nce</tt> with arguments passed in some form.
<div style="margin-left: 1em;">
<pre><code class="c++">"\{nce(LATIN CAPITAL LETTER A WITH GRAVE)}"</code></pre>
</div>
</p>
<p>
Such a feature could certainly be implemented, but would seem to necessarily
be more verbose and would necessitate inclusion of appropriate headers;
headers that would be quite large in the case of a named character database
or that would make use of a compiler intrinsic; which would put the complexity
back in the compiler (though in implementation-defined territory rather than
in standard core language). The verbosity concern could potentially be
reduced by introducing core language sugar for lowering the proposed syntax
to the example string interpolation syntax above.
</p>
<h1 id="proposal">Proposal</h1>
<p>
The wording included in this proposal is for the following design:
<ul>
<li>Context:
<ul>
<li>Named character escapes are valid only in character and string
literals.
</li>
</ul>
</li>
<li>Syntax:
<ul>
<li><tt>\N{xxx}</tt> where the name is substituted for <tt>xxx</tt>.</li>
</ul>
</li>
<li>Name sources:
<ul>
<li>ISO/IEC 10646 assigned names.</li>
<li>ISO/IEC 10646 assigned name aliases.</li>
<li>No allowance for additional implementation-defined names.</li>
</ul>
</li>
<li>Name matching:
<ul>
<li>As specified by rule
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
in
<a title="Unicode Standard Annex #44 - Unicode Character Database"
href="https://www.unicode.org/reports/tr44/tr44-24.html">
UAX#44</a><sup><a title="Unicode Standard Annex #44 - Unicode Character Database"
href="#ref_uax44">[UAX#44]</a></sup>.
</li>
</ul>
</li>
<li>Feature test macro:
<ul>
<li><tt>__cpp_named_character_escapes</tt></li>
</ul>
</li>
</ul>
</p>
<h1 id="proposal_opts">Proposal options</h1>
<p>
The following options are <em>not</em> currently proposed, but could be adopted
as modifications of the current proposal.
<ol>
<li>Instead of <tt>\N</tt>, reuse the <tt>\u</tt> and/or <tt>\U</tt> introducers from
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
to introduce a named character escape. For example:
<ul>
<li><tt>"\u{LATIN CAPITAL LETTER A WITH GRAVE}"</tt></li>
<li><tt>"\U{LATIN CAPITAL LETTER A WITH GRAVE}"</tt></li>
</ul>
See the
<a href="#design_compat">Backward compatibility</a>
section for more discussion of this option.
</li>
<li>Allow names to match ISO/IEC 10646 named sequences such that the
following would be equivalent:
<ul>
<li><tt>"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"</tt></li>
<li><tt>"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT"</tt></li>
<li><tt>"\u0100\u0300"</tt></li>
</ul>
</li>
</ol>
</p>
<h1 id="future">Possible future extensions</h1>
<p>
The following options are <em>not</em> currently proposed but could be considered
for future extension.
<ol>
<li>Allow named character escapes to be used outside of character and string
literals (e.g., in identifiers) analogously to
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s.
</li>
<li>Allow comma separated names. For example:
<ul>
<li><tt>"\N{LATIN CAPITAL LETTER A WITH MACRON, COMBINING GRAVE ACCENT}" // Equivalent to "\u0100\u0300"</tt></li>
</ul>
</li>
<li>Allow code point numbers as names. For example:
<ul>
<li><tt>"\N{U+00C0}" // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
<li><tt>"\N{0x00C0}" // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
<li><tt>"\N{192}" // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
</ul>
</li>
<li>Allow names to match Unicode emoji named sequences</li>
<li>Allow names to match Unicode emoji ZWJ named sequences</li>
<li>Allow names to match HTML 5 named character references by surrounding
them with <tt>&</tt> and <tt>;</tt>. For example:
<ul>
<li><tt>"\N{&Agrave;}" // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
</ul>
</li>
</ol>
</p>
<h1 id="implementation_exp">Implementation experience</h1>
<p>
This proposal has not yet been implemented in an existing compiler. However,
the implementation concerns raised in Belfast prompted Corentin Jabot to conduct
an experiement to determine how small the implementation overhead, in terms of
data and code within the compiler, could be reduced to. His
<a title="Storing Unicode: Character Name to Codepoint Mapping"
href="https://cor3ntin.github.io/posts/cp_to_name">
blog post</a><sup><a title="Storing Unicode: Character Name to Codepoint Mapping"
href="#ref_cj_blog">[CJ_BLOG]</a></sup>
on the experiment reported that he was able to implement a function
(<a href="https://github.com/cor3ntin/ext-unicode-db/blob/name_to_cp/name_to_cp.hpp#L215-L260"><tt>cp_from_name</tt></a>)
that accepts a Unicode 12.0 name or name alias and returns a code point value in
under 300 KiB. His implementation is available in the <tt>cp_to_name</tt>
branch of his <tt>ext-unicode-db</tt> GitHub repository at
<a title="ext-unicode-db"
href="https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp">
https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp</a><sup><a title="ext-unicode-db"
href="#ref_cj_impl">[CJ_IMPL]</a></sup>.
</p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>