-
Notifications
You must be signed in to change notification settings - Fork 0
/
p2572r0.html
958 lines (879 loc) · 34.8 KB
/
p2572r0.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<head>
<title>
std::format() fill character allowances;
proposed resolution for LWG issues 3576 and 3639
</title>
<style type="text/css">
pre {
display: inline;
}
table#header th,
table#header td
{
text-align: left;
}
table#references th,
table#references td
{
vertical-align: top;
}
#hideins:checked ~ * ins, #hideins:checked ~ * ins * { display:none; visibility:hidden }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
ins, ins *
{
text-decoration: underline;
color: #000000;
background-color:#C8FFC8
}
del, del *
{
text-decoration: line-through;
color: #000000;
background-color:#FFA0A0
}
nop, nop *
{
color: #000000;
background-color:#B0B0FF
}
blockquote
{
color: #000000;
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
padding-left: 0.5em;
padding-right: 0.5em;
}
blockquote.stdins
{
/* text-decoration: underline; */
color: #000000;
background-color: #C8FFC8;
border: 1px solid #B3EBB3;
padding: 0.5em;
}
blockquote.stddel
{
text-decoration: line-through;
color: #000000;
background-color: #FFA0A0;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
blockquote.stdnop
{
color: #000000;
background-color: #B0B0FF;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
</style>
</head>
<body style="max-width: 8.5in">
<table id="header">
<tr>
<th>Document Number:</th>
<td>P2572R0</td>
</tr>
<tr>
<th>Date:</th>
<td>2022-06-11</td>
</tr>
<tr>
<th>Audience:</th>
<td>SG16, LEWG</td>
</tr>
<tr>
<th>Reply-to:</th>
<td>Tom Honermann <tom@honermann.net></td>
</tr>
</table>
<h1 style="margin-bottom: 0.0em"><tt>std::format()</tt> fill character allowances</h1>
<h2 style="margin-top: 0.0em">(Proposed resolution for LWG issues 3576 and 3639)</h2>
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#design">Design considerations</a>
<ul>
<li><a href="#design-char-restrict">Character encoding restrictions</li>
<li><a href="#design-width-restrict">Estimated display width restrictions</li>
</ul>
</li>
<li><a href="#existing-practice">Existing practice</a></li>
<li><a href="#proposal">Proposal</a></li>
<li><a href="#future">Future considerations and ABI</a></li>
<li><a href="#implementation-exp">Implementation experience</a></li>
<li><a href="#impact">Implementation impact</a></li>
<li><a href="#ack">Acknowledgements</a></li>
<li><a href="#references">References</a></li>
<li><a href="#wording">Wording</a></li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>
Presented is a proposed resolution for the following LWG issues concerning
the specification of fill characters in <tt>std::format()</tt>.
<ul>
<li><a href="https://wg21.link/lwg3576">LWG issue 3576: Clarifying fill character in std::format</a></li>
<li><a href="https://wg21.link/lwg3639">LWG issue 3639: Handling of fill character width is underspecified in std::format</a></li>
</ul>
</p>
<p>
This proposal follows prior discussion as recorded in the:
<ul>
<li><a href="https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#august-25th-2021">SG16 meeting summary for 2021-08-25</a>.</li>
<li><a href="https://lists.isocpp.org/sg16/2021/11/2851.php">SG16 mailing list archives beginning 2021-11-27 with subject "Agenda for the 2021-12-01 SG16 telecon"</a>.</li>
<li><a href="https://lists.isocpp.org/sg16/2021/11/2867.php">SG16 mailing list archives beginning 2021-12-01 with subject "Proposed resolution for LWG3639: Handling of fill character width is underspecified in std::format"</a>.</li>
<li><a href="https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#december-1st-2021">SG16 meeting summary for 2021-12-01</a>.</li>
<li><a href="https://lists.isocpp.org/sg16/2021/12/2887.php">SG16 mailing list archives beginning 2021-12-02 with subject "Use of U+3000 IDEOGRAPHIC SPACE is common practice"</a>.</li>
<li><a href="https://lists.isocpp.org/sg16/2021/12/2889.php">SG16 mailing list archives beginning 2021-12-02 with subject "More Ruminations about fill characters and alignement (LWG3639)"</a>.</li>
</ul>
</p>
<p>
The current wording in
<a href="http://eel.is/c++draft/format.string.std#1">[format.string.std]p1</a>
restricts fill characters to
"any character other than <tt>{</tt> or <tt>}</tt>".
Depending on how "character" is interpreted, this may permit
characters with a negative display width,
characters with no display width,
characters with a display width greater than one,
chraracters with a varying display width,
characters with an actual display width that differs from their estimated width,
combining characters (with or without a non-combining lead character),
decomposed characters,
characters with right-to-left directionality,
control characters,
formatting characters, and
emoji.
The following table presents some examples of such characters.
<table border="1">
<tr>
<th>Glyph</th>
<th>Estimated width</th>
<th>Code point(s)</th>
<th>Character name</th>
</tr>
<tr>
<td><tt>><</tt></td>
<td>1</td>
<td>U+0007</td>
<td>BELL</td>
</tr>
<tr>
<td><tt>><</tt></td>
<td>1</td>
<td>U+0008</td>
<td>BACKSPACE</td>
</tr>
<tr>
<td><tt>><</tt></td>
<td>1</td>
<td>U+007F</td>
<td>DELETE</td>
</tr>
<tr>
<td><tt>>	<</tt></td>
<td>1</td>
<td>U+0009</td>
<td>CHARACTER TABULATION</td>
</tr>
<tr>
<td><tt>>​<</tt></td>
<td>1</td>
<td>U+200B</td>
<td>ZERO WIDTH SPACE</td>
</tr>
<tr>
<td><tt>>́<</tt></td>
<td>1</td>
<td>U+0301</td>
<td>COMBINING ACCUTE ACCENT</td>
</tr>
<tr>
<td><tt>>é<</tt></td>
<td>1</td>
<td>U+00E9</td>
<td>LATIN SMALL LETTER E WITH ACUTE</td>
</tr>
<tr>
<td><tt>>é<</tt></td>
<td>1</td>
<td>U+0065<br/>U+0301</td>
<td>LATIN SMALL LETTER E<br/>COMBINING ACCUTE ACCENT</td>
</tr>
<tr>
<td><tt>>è́̂̃̄<</tt></td>
<td>1</td>
<td>U+0065<br/>U+0300<br/>U+0301<br/>U+0302<br/>U+0303<br/>U+0304</td>
<td>LATIN SMALL LETTER E<br/>COMBINING GRAVE ACCENT<br/>COMBINING ACUTE ACCENT<br/>COMBINING CIRCUMFLEX ACCENT<br/>COMBINING TILDE<br/>COMBINING MACRON</td>
</tr>
<tr>
<td><tt>>e<</tt></td>
<td>2</td>
<td>U+FF45</td>
<td>FULLWIDTH LATIN SMALL LETTER E</td>
</tr>
<tr>
<td><tt>>ェ<</tt></td>
<td>2</td>
<td>U+30A7</td>
<td>KATAKANA LETTER SMALL E</td>
</tr>
<tr>
<td><tt>>ェ<</tt></td>
<td>1</td>
<td>U+FF6A</td>
<td>HALFWIDTH KATAKANA LETTER SMALL E</td>
</tr>
<tr>
<td><tt>> <</tt></td>
<td>2</td>
<td>U+3000</td>
<td>IDEOGRAPHIC SPACE</td>
</tr>
<tr>
<td><tt>>ת<</tt></td>
<td>1</td>
<td>U+05EA</td>
<td>HEBREW LETTER TAV (a right-to-left character)</td>
</tr>
<tr>
<td><tt>>🤡<</tt></td>
<td>2</td>
<td>U+1F921</td>
<td>CLOWN FACE</td>
</tr>
<tr>
<td><tt>>﷽<</tt></td>
<td>1</td>
<td>U+FDFD</td>
<td><spane style="white-space:nowrap;">ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM</span></td>
</tr>
</table>
[ <em>Note:</em>
The glyphs are presented in monospace font and may render inconsistently across
browsers and operating systems.
The glyphs are displayed between '>' and '<' characters to make it easier
to see their presentation width.
The estimated width value corresponds to
<a href="http://eel.is/c++draft/format.string.std#11">[format.string.std]p11</a>;
that paragraph associates an estimated width of either one or two with all
characters.
In each case, the code point sequence constitutes a single extended grapheme
cluster.
— <em>end note</em> ]
</p>
<p>
It is likely that the displayed character differs from the estimated width for
at least some cases above; most likely the last case.
Unfortunately, there is no specification currently available that governs
character display width; actual width may vary based on font selection.
</p>
<p>
Use of a fill character with a display width other than one potentially
prevents a <tt>std::format()</tt> implementation from properly aligning
fields.
Consider a format specification for a field of width four and a field argument
with an estimated field width of one.
The implementation is expected to insert fill characters to consume an estimated
field width of three, but that is not possible if the fill character has an
estimated field width of two.
Portable behavior requires that the standard clarify the intended behavior
for such characters.
</p>
<p>
A <tt>std::format()</tt> implementation must store or reference a fill character
in some way.
Fill character allowances may impose dynamic memory management requirements or
increase the complexity of parsing standard format specifiers depending on
implementation choices.
Implementation choices may also cause fill character restrictions to be
reflected in the ABI thus making it difficult to relax restrictions later.
Portable behavior requires that the standard specify whether fill characters are
restricted to those that are encoded as, for example, a single code unit, a
single UCS scalar value, a
<a href="https://unicode.org/reports/tr15/#Stream_Safe_Text_Format">stream-safe extended grapheme cluster</a>
<sup><a title="Unicode Standard Annex #15 - Unicode Normalization Forms"
href="#ref_uax15">[UAX#15]</a></sup>,
or an extended grapheme cluster of unbounded length.
</p>
<h1 id="design">Design considerations</h1>
<h2 id="design-char-restrict">Character encoding restrictions</h2>
<p>
Fill character allowances pose a performance and overhead tradeoff.
Consider the following four options for fill character support.
<ol>
<li>Allow any extended grapheme cluster (EGC).</li>
<li>Allow any
<a href="https://unicode.org/reports/tr15/#Stream_Safe_Text_Format">stream-safe EGC</a>
<sup><a title="Unicode Standard Annex #15 - Unicode Normalization Forms"
href="#ref_uax15">[UAX#15]</a></sup>.
<li>Allow any single UCS scalar value.</li>
<li>Allow any single UCS scalar value that is encoded using a single code
unit.</li>
</ol>
</p>
<p>
The first option (any EGC) would require implementations to support EGCs that
consist of an unbounded number of code points.
This option implies dynamic memory management and would require implementations
to identify EGC boundaries in the format string; a requirement that otherwise
does not exist at present (implementations are currently required to identify
EGC boundaries in formatted field arguments for the purpose of computing the
estimated width, but not in the format string itself).
</p>
<p>
The second option (any stream-safe EGC) would require implementations to support
EGCs that consist of up to 32 code points.
This option allows an implementation to trade off dynamic memory allocation in
favor of larger data structures, but still requires EGC boundary analysis of
format strings.
</p>
<p>
The third option (any single UCS scalar value) avoids dynamic memory
requirements and significant increases to sizes of data structures;
the fill character could be stored in a single <tt>char32_t</tt> object.
</p>
<p>
The fourth option (any single code unit) reduces fill character storage
requirements to a single code unit (<tt>char</tt> or <tt>wchar_t</tt>), but has
the unfortunate side effect of making the permissible set of fill characters
dependent on encoding. For example, U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
would be rejected in a UTF-8 encoded format string, but would be accepted in a
UTF-16 encoded one. Similarly, U+1F921 (CLOWN FACE) would be rejected in a
UTF-16 encoded format string, but accepted in a UTF-32 encoded one.
</p>
<h2 id="design-width-restrict">Estimated display width restrictions</h2>
<p>
The following behaviors represent possible options for formatting fields when
the fill character has an estimated width other than one.
<ol>
<li>Use an estimated width of one for the fill character regardless of the
value specified in
<a href="http://eel.is/c++draft/format.string.std#11">[format.string.std]p11</a>.</li>
<li>Overfill<br/>
Insert fill characters until the estimated width of the formatted field
argument and the fill characters meets or exceeds the field width.</li>
<li>Underfill<br/>
Insert fill characters so long as the estimated width of the formatted
field argument and the fill characters does not exceed the field
width.</li>
<li>Pad with a different fill character<br/>
Insert an alternate fill character known to have an estimated width of
one when inserting the requested fill character would have caused the
field width to be exceeded.</li>
<li>Undefined, unspecified, or implementation-defined behavior<br/>
Impose no portable behavior.</li>
<li>Error unconditionally<br/>
Throw a <tt>format_error</tt> exception.</li>
<li>Error if the alignment to the field width is not possible<br/>
Throw a <tt>format_error</tt> exception if the remainder of the field
width after subtracting the estimated width of the formatted field
argument is not evenly divisible by the estimated width of the fill
character.</li>
</ol>
</p>
<p>
The following table illustrates the above options for
<tt>std::format(">{:🤡^4}<\n", 'X')</tt>.
Font selection will determine to what degree the results shown deviate from
the reference alignment.
<table border="1">
<tr>
<th>Behavioral choice</th>
<th>Result</th>
</tr>
<tr>
<td>(reference alignment)
<td><tt>>-X--<</tt></td>
</tr>
<tr>
<td>Use an estimated width of one</th>
<td><tt>>🤡X🤡🤡<</tt></td>
</tr>
<tr>
<td>Overfill</th>
<td><tt>>🤡X🤡<</tt></td>
</tr>
<tr>
<td>Underfill</th>
<td><tt>>X🤡<</tt></td>
</tr>
<tr>
<td>Pad with a different fill character (space)</th>
<td><tt>> X🤡<</tt></td>
</tr>
<tr>
<td>Undefined, unspecified, or implementation-defined behavior</th>
<td>???</td>
</tr>
<tr>
<td>Error (unconditionally or due to inability to align)</th>
<td>N/A</td>
</tr>
</table>
</p>
<h1 id="existing-practice">Existing practice</h1>
<p>
The following table illustrates existing behavior for several
<tt>std::format()</tt> implementations when the example characters from the
<a href="#introduction">introduction</a>
are used as the fill character with a directionally neutral field argument of
<tt>'#'</tt> (the directionality affects the behavior of the U+05EA example).
The first row illustrates a reference alignment.
<table border="`">
<tr>
<th>Code point(s)</th>
<th>Format string</th>
<th>Clang 15 trunk<br/>with libc++</th>
<th>Gcc 12.1<br/>with fmt 8.1.1</th>
<th>MSVC 19.31</th>
</tr>
<tr>
<td>U+002D HYPHEN-MINUS</td>
<td><tt>">{:-^4}<"</tt></td>
<td><tt>>-#--<</tt></td>
<td><tt>>-#--<</tt></td>
<td><tt>>-#--<</tt></td>
</tr>
<tr>
<td>U+0007 BELL</td>
<td><tt>">{:^4}<"</tt></td>
<td><tt>>#<</tt></td>
<td><tt>>#<</tt></td>
<td><tt>>#<</tt></td>
</tr>
<tr>
<td>U+0008 BACKSPACE</td>
<td><tt>">{:^4}<"</tt></td>
<td><tt>>#<</tt></td>
<td><tt>>#<</tt></td>
<td><tt>>#<</tt></td>
</tr>
<tr>
<td>U+007F DELETE</td>
<td><tt>">{:^4}<"</tt></td>
<td><tt>>#<</tt></td>
<td><tt>>#<</tt></td>
<td><tt>>#<</tt></td>
</tr>
<tr>
<td>U+0009 CHARACTER TABULATION</td>
<td><tt>">{:	^4}<"</tt></td>
<td><tt>><span style="white-space: pre">	#		</span><</tt></td>
<td><tt>><span style="white-space: pre">	#		</span><</tt></td>
<td><tt>><span style="white-space: pre">	#		</span><</tt></td>
</tr>
<tr>
<td>U+200B ZERO WIDTH SPACE</td>
<td><tt>">{:​^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>​#​​<</tt></td>
<td><tt>>​#​​<</tt></td>
</tr>
<tr>
<td>U+0301 COMBINING ACCUTE ACCENT</td>
<td><tt>">{:́^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>́#́́<</tt></td>
<td><tt>>́#́́<</tt></td>
</tr>
<tr>
<td>U+00E9 LATIN SMALL LETTER E WITH ACUTE</td>
<td><tt>">{:é^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>é#éé<</tt></td>
<td><tt>>é#éé<</tt></td>
</tr>
<tr>
<td>U+0065 LATIN SMALL LETTER E<br/>U+0301 COMBINING ACCUTE ACCENT</td>
<td><tt>">{:é^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><b>Error</b><sup>2</sup></td>
<td><b>Error</b><sup>3</sup></td>
</tr>
<tr>
<td>U+0065 LATIN SMALL LETTER E<br/>U+0300 COMBINING GRAVE ACCENT<br/>U+0301 COMBINING ACUTE ACCENT<br/>U+0302 COMBINING CIRCUMFLEX ACCENT<br/>U+0303 COMBINING TILDE<br/>U+0304 COMBINING MACRON</td>
<td><tt>">{:è́̂̃̄^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><b>Error</b><sup>2</sup></td>
<td><b>Error</b><sup>3</sup></td>
</tr>
<tr>
<td>U+FF45 FULLWIDTH LATIN SMALL LETTER E</td>
<td><tt>">{:e^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>e#ee<</tt></td>
<td><tt>>e#ee<</tt></td>
</tr>
<tr>
<td>U+30A7 KATAKANA LETTER SMALL E</td>
<td><tt>">{:ェ^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>ェ#ェェ<</tt></td>
<td><tt>>ェ#ェェ<</tt></td>
</tr>
<tr>
<td>U+FF6A HALFWIDTH KATAKANA LETTER SMALL E</td>
<td><tt>">{:ェ^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>ェ#ェェ<</tt></td>
<td><tt>>ェ#ェェ<</tt></td>
</tr>
<tr>
<td>U+3000 IDEOGRAPHIC SPACE</td>
<td><tt>">{: ^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>> #  <</tt></td>
<td><tt>> #  <</tt></td>
</tr>
<tr>
<td>U+05EA HEBREW LETTER TAV</td>
<td><tt>">{:ת‎^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>ת#תת<‎<sup>4</sup></tt><br/>
<tt>>תXתת<‎<sup>4</sup></tt></td>
<td><tt>>ת#תת<‎<sup>4</sup></tt><br/>
<tt>>תXתת<‎<sup>4</sup></tt></td>
</tr>
<tr>
<td>U+1F921 CLOWN FACE</td>
<td><tt>">{:🤡^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>🤡#🤡🤡<</tt></td>
<td><tt>>🤡#🤡🤡<</tt></td>
</tr>
<tr>
<td><span style="white-space:nowrap;">U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM</span></td>
<td><tt>">{:﷽^4}<"</tt></td>
<td><b>Error</b><sup>1</sup></td>
<td><tt>>﷽#﷽﷽<</tt></td>
<td><tt>>﷽#﷽﷽<</tt></td>
</tr>
</table>
<p>
1) Clang with libc++ restricts fill characters to characters that are encoded
as a single code unit; an exception is thrown that, if not caught, aborts
the process with the following error message.
<div style="margin-left:1em; background-color:lemonchiffon"><tt>
libc++abi: terminating with uncaught exception of type std::__1::format_error: The format-spec should consume the input or end with a '}'
</tt></div>
</p>
<p>
2) Gcc with fmt restricts fill characters to characters that are encoded as a
single code point. Compilation fails with the following error message.
<div style="margin-left:1em; background-color:lemonchiffon"><tt>
error: call to non-'constexpr' function 'void fmt::v8::detail::error_handler::on_error(const char*)'
</tt></div>
</p>
<p>
3) MSVC restricts fill characters to characters that are encoded as a single
code point. Compilation is successful, but program execution terminates
with an exit code of
3221226505 (0xC0000409: <tt>STATUS_STACK_BUFFER_OVERRUN</tt>).
The buffer overflow has been corrected for the next MSVC release and these
cases are now rejected with the following error message.
<div style="margin-left:1em; background-color:lemonchiffon"><tt>
error C7595: 'std::_Basic_format_string<char,char>::_Basic_format_string': call to immediate function is not a constant expression<br/>
.../msvc/14.32.31302/include\format(1378): note: failure was caused by evaluating a throw sub-expression
</tt></div>
</p>
<p>
4) Use of a fill character with right-to-left directionality potentially causes
the formatted field to be rendered right to left depending on the formatted
field argument.
Two examples are provided, one in which the directionally neutral character
<tt>'#'</tt> is used as the formatted field argument and one in which the
left-to-right character <tt>'X'</tt> is used.
U+200E LEFT-TO-RIGHT MARK characters have been inserted by the paper
author to negate the right-to-left effect on surrounding text.
In practice, the right-to-left directionality may affect how surrounding
text from the format string or other format fields are presented.
</p>
<p>
All surveyed implementations assume an estimated width of 1 for fill characters
regardless of the estimated width values specified in
<a href="http://eel.is/c++draft/format.string.std#11">[format.string.std]p11</a>.
</p>
<h1 id="proposal">Proposal</h1>
<p>
Standardize the behavior exhibited by gcc with fmt and by MSVC:
<ul>
<li>Restrict fill characters to a single UCS scalar value.<br/>
(This restriction can be lifted in the future if motivation arises for
support of EGCs that contain multiple UCS scalar values but may require
an ABI break depending on implementation details)</li>
<li>Always use an estimated width of one for fill characters.<br/>
(Ignore
<a href="http://eel.is/c++draft/format.string.std#11">[format.string.std]p11</a>
when determining how many fill characters to insert)</li>
<li>Add a note that alignment options have no effect if the estimated width
of the formatted field argument exceeds the field width.</li>
<li>Clarify wording to explicitly describe how fill characters are inserted
in order to achieve field alignment.</li>
</ul>
</p>
<h1 id="future">Future considerations and ABI</h1>
<p>
Programmers may find use cases where it is necessary for the number of inserted
fill characters to depend on the estimated width of the fill character.
Some of those use cases may warrant support in the standard.
If such motivation arises, there are at least two methods by which support
could be added.
<ol>
<li>The
<a href="http://eel.is/c++draft/format.string.std"><em>std-format-spec</em> format specification</a>
could be extended to allow an additional option to be specified to opt-in
to the desired behavior.</li>
<li>Specializations of <tt>std::formatter</tt> could be defined to provide
custom formatting on a per-type basis as is done for the chrono
library.</li>
</ol>
</p>
<p>
Motivation may arise in the future to permit the use of an EGC that consists of
multiple code points as a fill character.
Implementations that store a single <tt>char32_t</tt> or short sequence of code
units in their <tt>formatter</tt> class specializations
(<a href="http://eel.is/c++draft/format#formatter.spec">[format.formatter.spec]</a>)
may be unable to accommodate such a change without an ABI break.
Implementations are encouraged to instead store a view (an iterator pair, start
and end index, or start index and length) into the <em>std-format-spec</em>
(<a href="http://eel.is/c++draft/format#string.std-1">[format.string.std]p1</a>)
string so that code unit sequences of arbitrary length can be referenced.
However, since format strings are evaluated at compile-time, there is currently
no need for them to be persisted until run-time, so storing a view may impose
storage overhead.
</p>
<p>
It appears that the Microsoft implementation is currently susceptible to such
ABI breaks based on the implementation of their
<a href="https://github.com/microsoft/STL/blob/1a20fe1133d711a647bbb135d98743f91b7be323/stl/inc/format#L1327-L1340"><tt>_Basic_format_specs</tt> class template</a>.
Specializations of <tt>_Basic_format_specs</tt> form the base class of their
<a href="https://github.com/microsoft/STL/blob/1a20fe1133d711a647bbb135d98743f91b7be323/stl/inc/format#L1342-L1349"><tt>_Dynamic_format_specs</tt> class template</a>
for which a specialization is stored in their
<a href="https://github.com/microsoft/STL/blob/1a20fe1133d711a647bbb135d98743f91b7be323/stl/inc/format#L3263-L3297"><tt>_Formatter_base</tt> class template</a>
that forms the base class of their
<a href="https://github.com/microsoft/STL/blob/1a20fe1133d711a647bbb135d98743f91b7be323/stl/inc/format#L3299-L3347"><tt>std::formatter</tt> specializations</a>.
Microsoft is already shipping their implementation and is thus already locked
into their current ABI.
</p>
<p>
The author has not researched the ABI break susceptibility of other
implementations.
</p>
<h1 id="implementation-exp">Implementation experience</h1>
<p>
This proposal standardizes the behavior exhibited by both gcc with fmt and MSVC
and therefore reflects existing practice.
However the ABI mitigations described in the prior section are not known to have
been implemented.
</p>
<h1 id="impact">Implementation impact</h1>
<p>
Some implementations, libc++ for example, will require changes to allow any
single UCS scalar value to be specified as a fill character.
This may impose new encoding awareness requirements on format string parsers so
that fill characters encoded with more than one code unit are correctly decoded.
</p>
<h1 id="ack">Acknowledgements</h1>
<p>
Thank you to Victor Zverovich, Corentin Jabot, Peter Brett, and Mark de Wever
for their insights; their commentary shaped much of this proposal.
</p>
<h1 id="references">References</h1>
<table id="references">
<tr>
<td id="ref_n4910"><sup>[N4910]</sup></td>
<td>
"Working Draft, Standard for Programming Language C++", N4910, 2022.<br/>
<a href="https://wg21.link/n4910">
https://wg21.link/n4910</a></td>
</tr>
<tr>
<td id="ref_uax15"><sup>[UAX#15]</sup></td>
<td>
Ken Whistler,<br/>
"Unicode Standard Annex #15 - Unicode Normalization Forms",<br/>
Revision 51, Unicode 14.0.0, 2021.<br/>
<a href="https://www.unicode.org/reports/tr15/tr15-51.html">https://www.unicode.org/reports/tr15/tr15-51.html</a>
</td>
</tr>
</table>
<h1 id="wording">Wording</h1>
<p>These changes are relative to
<a title="Working Draft, Standard for Programming Language C++"
href="https://wg21.link/n4910">
N4910</a>
<sup><a title="Working Draft, Standard for Programming Language C++"
href="#ref_n4910">[N4910]</a></sup>.
</p>
<input type="checkbox" id="hideins">Hide inserted text</input><br/>
<input type="checkbox" id="hidedel">Hide deleted text</input>
<p>
Change in
<a href="http://eel.is/c++draft/format.string.std#2">22.14.2.2 [format.string.std] paragraph 2</a>:
<blockquote>
<ins>The <em>fill character</em> is the character denoted by the <em>fill</em>
option or, if the <em>fill</em> option is absent, the space
character.
For a format specification in a Unicode encoding, the fill character is a
single UCS scalar value.
<br/>
<br/>
</ins>
[<em>Note 2</em>:
<del>The <em>fill</em> character can be any character other than { or }.</del>
The presence of a
<del>fill character</del><ins><em>fill</em> option</ins>
is signaled by the character following it,
which must be one of the alignment options.
If the second character of <em>std-format-spec</em> is not a valid alignment
option, then it is assumed that
<del>both the fill character and the alignment option are</del><ins>the
<em>fill-and-align</em> option is</ins>
absent. — <em>end note</em>]
</blockquote>
</p>
<p>
Change in
<a href="http://eel.is/c++draft/format.string.std#3">22.14.2.2 [format.string.std] paragraph 3</a>:<br/>
<em>Drafting note:</em> The change of "specifier" to "option" for the
<em>align</em> grammar element is intended to improve consistency with
other grammar elements.
Inconsistencies remain however.
The terms currently used for each of the grammar elements are below.
LWG may wish to review further.
<ul>
<li><em>fill-and-align</em>: none (neither specifier nor option)</li>
<li><em>fill</em>: none (neither specifier nor option)</li>
<li><em>align</em>: specifier</li>
<li><em>sign</em>: option</li>
<li><em>#</em>: option</li>
<li><em>0</em>: none (neither specifier nor option).</li>
<li><em>width</em>: none (neither specifier nor option)</li>
<li><em>precision</em>: none (neither specifier nor option)</li>
<li><em>L</em>: option</li>
<li><em>type</em>: option</li>
</ul>
<blockquote>
The <em>align</em> <del>specifier</del><ins>option</ins> applies to all
argument types.
The meaning of the various alignment options is as specified in
<a href="http://eel.is/c++draft/format.string.std#tab:format.align">Table 64</a>.
<br/><br/>
[ <em>Example 1</em>:
<div style="margin-left: 1em;"><pre>
char c = 120;
string s0 = format("{:6}", 42); // value of s0 is " 42"
string s1 = format("{:6}", 'x'); // value of s1 is "x "
string s2 = format("{:*<6}", 'x'); // value of s2 is "x*****"
string s3 = format("{:*>6}", 'x'); // value of s3 is "*****x"
string s4 = format("{:*^6}", 'x'); // value of s4 is "**x***"
string s5 = format("{:6d}", c); // value of s5 is " 120"
string s6 = format("{:6}", true); // value of s6 is "true "
<ins>string s7 = format("{:🤡^6}", "x"); // value of s7 is "🤡🤡x🤡🤡🤡"</ins>
<ins>string s8 = format("{:*^6}", "🤡🤡🤡"); // value of s8 is "🤡🤡🤡"</ins>
<ins>string s9 = format("{:*>6}", "12345678"); // value of s9 is "12345678"</ins>
</pre></div>
— <em>end example</em> ]
<br/><br/>
[ <em>Note 3</em>:
<del>Unless a minimum field width is defined, the field width is determined by
the size of the content and the alignment option has no effect.</del><ins>If
the <em>width</em> option is absent, then the field width is the estimated
width of the formatted argument and the alignment option has no effect.
If the estimated width of the formatted argument matches or exceeds the field
width, then both the alignment and width options have no effect.
The width of any fill character is assumed to be 1.
The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2.
The examples above that include that character illustrate the effect of the
estimated width when that character is used as a fill character as opposed to
when it is used as a formatting argument.</ins>
— <em>end note</em> ]
<br/><br/>
<table>
<tr>
<td>
<table style="width:100%">
<tr>
<td style="text-align:center">Table
<a href="http://eel.is/c++draft/format.string.std#tab:format.align">64</a>:
Meaning of align options</td>
<td>[<a href="http://eel.is/c++draft/tab:format.align">tab:format.align</a>]</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>
<table style="width:100%; border:1px solid black">
<tr style="border:1px solid black">
<th style="border-bottom:1px solid black">Option</th>
<th style="border-bottom:1px solid black">Meaning</th>
</tr>
<tr>
<td style="border-bottom:1px solid black"><</td>
<td style="border-bottom:1px solid black">
Forces the field to be aligned to the start of the available
space<ins> by inserting <em>n</em> fill characters after the
formatted argument where <em>n</em> is the field width minus the
estimated width of the formatted argument or, if the subtraction
results in a negative value, 0</ins>.
This is the default for non-arithmetic non-pointer types,
<tt>charT</tt>, and <tt>bool</tt>, unless an integer presentation
type is specified.</td>
</tr>
<tr>
<td style="border-bottom:1px solid black">></td>
<td style="border-bottom:1px solid black">
Forces the field to be aligned to the end of the available
space<ins> by inserting <em>n</em> fill characters before the
formatted argument where <em>n</em> is the field width minus the
estimated width of the formatted argument or, if the subtraction
results in a negative value, 0</ins>.
This is the default for arithmetic types other than
<tt>charT</tt> and <tt>bool</tt>, pointer types, or when an integer
presentation type is specified.</td>
</tr>
<tr>
<td style="border-bottom:1px solid black">^</td>
<td style="border-bottom:1px solid black">
Forces the field to be centered within the available space by
inserting ⌊<em>n</em>/2⌋ <ins>fill</ins> characters before and
⌈<em>n</em>/2⌉ <ins>fill </ins>characters after the
<ins>formatted argument</ins><del>value</del>, where <em>n</em> is
<del>the total number of fill characters to insert</del><ins>the
field width minus the estimated width of the formatted argument
or, if the subtraction results in a negative value, 0</ins>.</td>
</tr>
</table>
</td>
</tr>
</table>
</blockquote>
</p>
<p>
Change in
<a href="http://eel.is/c++draft/format.string.std#11">22.14.2.2 [format.string.std] paragraph 11</a>:<br/>
<em>Drafting note:</em> This change is a minor terminology correction
unrelated to the primary goals of this proposal.
<blockquote>
For a string in a Unicode encoding, implementations should estimate the width
of a string as the sum of <ins>the </ins>estimated widths of the first
<del>code points</del><ins>UCS scalar values</ins>
in its extended grapheme clusters.
The extended grapheme clusters of a string are defined by UAX #29.
The estimated width of the following
<del>code points</del><ins>UCS scalar values</ins>
is 2:<br/>
<br/>
<em>[ … ]</em><br/>
<br/>
The estimated width of other
<del>code points</del><ins>UCS scalar values</ins>
is 1.
</blockquote>
</p>
</body>