-
Notifications
You must be signed in to change notification settings - Fork 0
/
p1423r0.html
1289 lines (1116 loc) · 45 KB
/
p1423r0.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<head>
<title>char8_t backward compatibility remediation</title>
<link rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css"/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<style type="text/css">
pre {
display: inline;
}
table#header th,
table#header td
{
text-align: left;
}
table#references th,
table#references td
{
vertical-align: top;
}
ins, ins * { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
del, del * { text-decoration:line-through; background-color:#FFA0A0 }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
blockquote
{
color: #000000;
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
padding-left: 0.5em;
padding-right: 0.5em;
}
blockquote.stdins
{
text-decoration: underline;
color: #000000;
background-color: #C8FFC8;
border: 1px solid #B3EBB3;
padding: 0.5em;
}
blockquote.stddel
{
text-decoration: line-through;
color: #000000;
background-color: #FFEBFF;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
</style>
</head>
<body>
<table id="header">
<tr>
<th>Document Number:</th>
<td>P1423R0</td>
</tr>
<tr>
<th>Date:</th>
<td>2019-01-20</td>
</tr>
<tr>
<th>Audience:</th>
<td>Evolution Working Group<br/>
Library Evolution Working Group</td>
</tr>
<tr>
<th>Reply-to:</th>
<td>Tom Honermann <tom@honermann.net></td>
</tr>
</table>
<h1>char8_t backward compatibility remediation</h1>
<ul>
<li><a href="#introduction">
Introduction</a></li>
<li><a href="#examples">
Examples</a></li>
<li><a href="#impact">
Anticipated impact</a></li>
<li><a href="#remediation">
Remediation approaches</a>
<ul>
<li><a href="#disable">
Disable <tt>char8_t</tt> support</a></li>
<li><a href="#overload">
Add overloads</a></li>
<li><a href="#ordinary">
Change <tt>u8</tt> literals to ordinary literals with escape sequences</a></li>
<li><a href="#reinterpret_cast">
reinterpret_cast <tt>u8</tt> literals to <tt>char</tt></a></li>
<li><a href="#emulate">
Emulate C++17 <tt>u8</tt> literals</a></li>
<li><a href="#array-subst">
Substitute class types for C arrays initialized with <tt>u8</tt> string literals</a></li>
</li>
<li><a href="#conversion_fns">
Use explicit conversion functions</a></li>
<li><a href="#tooling">
Tooling</a></li>
</ul>
</li>
<li><a href="#options">
Options considered to reduce backward compatibility impact</a>
<ul>
<li><a href="#option1">
1) Reinstate <tt>u8</tt> literals as type <tt>char</tt> and introduce a new literal prefix for <tt>char8_t</tt></a></li>
<li><a href="#option2">
2) Allow implicit conversions from <tt>char8_t</tt> to <tt>char</tt></a></li>
<li><a href="#option3">
3) Allow initializing an array of <tt>char</tt> with a <tt>u8</tt> string literal</a></li>
<li><a href="#option4">
4) Allow initializing an array with a reference to an array</a></li>
<li><a href="#option5">
5) Allow <tt>std::string</tt> to be initialized with <tt>char8_t</tt> based types</a></li>
<li><a href="#option6">
6) Allow implicit conversions from <tt>std::u8string</tt> to <tt>std::string</tt></a></li>
<li><a href="#option7">
7) Add deleted ostream inserters for <tt>char8_t</tt>, <tt>char16_t</tt>, and <tt>char32_t</tt></a></li>
<li><a href="#option8">
8) Allow <tt>std::filesystem::u8path</tt> to accept ranges and iterators with <tt>char8_t</tt> value types</a></li>
</ul>
</li>
<li><a href="#proposal">
Proposal</a></li>
<li><a href="#wording">
Wording</a>
<ul>
<li><a href="#library_wording">
Library wording</a></li>
<li><a href="#annex_c_wording">
Annex C Compatibility wording</a></li>
<li><a href="#annex_d_wording">
Annex D Compatibility features wording</a></li>
</ul>
</li>
<li><a href="#references">
References</a></li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>The support for <tt>char8_t</tt> as adopted for C++20 via
<a title="char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html">
P0482R6</a>
<sup><a title="char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="#ref_p0482r6">
[P0482R6]</a></sup> affects backward
compatibility for existing C++17 programs in at least the following ways:
<ol>
<li>Introduction of a new <tt>char8_t</tt> keyword, new
<tt>std::u8string</tt>,
<tt>std::u8string_view</tt>,
<tt>std::u8streampos</tt> type aliases and
<tt>std::mbrtoc8</tt> and
<tt>std::c8rtomb</tt> functions; these names may conflict with existing
uses of these names.
</li>
<li>Change of return type for <tt>std::filesystem::path</tt> member functions
<tt>u8string</tt> and <tt>generic_u8string</tt>.
</li>
<li>Change of type for <tt>u8</tt> character and string literals.</li>
</ol>
</p>
<p>This paper does <em>not</em> further discuss case 1 above. Adding new
keywords and new members to the <tt>std</tt> namespace is business as usual;
see
<a title="SD-8: Standard Library Compatibility"
href="https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility">
SD-8</a>
<sup><a title="https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility"
href="#ref_sd8">
[SD-8]</a></sup>. It is
acknowledged that these additions will affect some code bases. Code surveys
have found that these names have generally been used to emulate the set of
features introduced with the adoption of
<a title="char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html">
P0482R6</a>
<sup><a title="char8_t: A type for UTF-8 characters and strings (Revision 6)"
href="#ref_p0482r6">
[P0482R6]</a></sup>. In some cases, existing code has already been updated to
adapt to the new standard features. For example,
<a href="https://github.com/electronicarts/EASTL">EASTL</a> will now use the
the standard provided <tt>char8_t</tt> type when available instead of the type
alias previously used. The pull request for this change can be found at
<a href="https://github.com/electronicarts/EASTL/pull/239">
https://github.com/electronicarts/EASTL/pull/239</a>.
</p>
<p>Case 2 above is a change that does <em>not</em> fit into the set of standard
library rights reserved in
<a title="SD-8: Standard Library Compatibility"
href="https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility">
SD-8</a>
<sup><a title="https://isocpp.org/std/standing-documents/sd-8-standard-library-compatibility"
href="#ref_sd8">
[SD-8]</a></sup>.
This is a cause for concern, but is somewhat mitigated by the fact that
<tt>std::filesystem</tt> is new with C++17 and therefore does not have a long
history of use. Some options for dealing with this change are discussed later
in this paper.
</p>
<p>Case 3 above is the change responsible for most of the backward
compatibility impact.
</p>
<p>This paper is motivated by three goals:
<ul>
<li>To document a set of options available to programmers to facilitate
migration of existing code to C++20. Where possible, options are
presented for writing code that is compatible with both C++17 and C++20.
</li>
<li>To ensure that WG21 members are aware of the backward compatibility
issues and anticipated impact, and find the set of options available to
mitigate the impact acceptable.
</li>
<li>To consider options available to reduce backward compatibility impact.
This paper documents a number of such options, but only proposes two
small standard library changes intended to remove backward compatibility
impact that was not intended by the adoption of P0482R6.
</li>
</ul>
</p>
<h1 id="examples">Examples</h1>
<p>The following table presents examples of well-formed C++17 code that is
either ill-formed or behaves differently in C++20. The table also reflects the
intended changes proposed in this paper. Note that most of these examples
remain ill-formed with this proposal. This is intentional as the examples
reflect problematic code that leads to mojibake in C++17 code due to use of the
same type (<tt>char</tt>) for multiple encodings (execution encoding and UTF-8).
</p>
<p>
<table border="1">
<tr>
<th>Code</th>
<th>C++17</th>
<th>C++20 with P0482R6</th>
<th>C++20 with this proposal</th>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">const char *p = u8"text";</code></pre>
</fieldset>
</td>
<td>Initializes <tt>p</tt> with the address of the UTF-8 encoded string.</td>
<td>Ill-formed.</td>
<td>Ill-formed.</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">char a[] = u8"text";</code></pre>
</fieldset>
</td>
<td>Initializes <tt>a</tt> with the UTF-8 encoded string.</td>
<td>Ill-formed.</td>
<td>Ill-formed.</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">int operator ""_udl(const char*, unsigned long);
int v = u8"text"_udl;</code></pre>
</fieldset>
</td>
<td>Initializes <tt>v</tt> with the result of calling
<tt>operator ""_udl</tt> with the UTF-8 encoded string literal.
</td>
<td>Ill-formed.</td>
<td>Ill-formed.</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">std::string s(u8"text");</code></pre>
</fieldset>
</td>
<td>Initializes <tt>s</tt> with the UTF-8 encoded string.</td>
<td>Ill-formed.</td>
<td>Ill-formed.</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">std::filesystem::path p = ...;
std::string s = p.u8string();</code></pre>
</fieldset>
</td>
<td>Initializes <tt>s</tt> with the UTF-8 encoded representation
of the file path stored in <tt>p</tt>.
</td>
<td>Ill-formed.</td>
<td>Ill-formed.</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">std::cout << u8'x';
std::cout << u8"text";</code></pre>
</fieldset>
</td>
<td>Writes a sequence of UTF-8 code units as characters to stdout.<br/>
(mojibake if the execution character encoding is not UTF-8)
</td>
<td>Writes an integer or pointer value to stdout.<br/>
(consistent with handling of char16_t and char32_t)
</td>
<td>Ill-formed.<br/>
(for all of char8_t, char16_t, and char32_t)
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">std::filesystem::u8path(u8"filename");</code></pre>
</fieldset>
</td>
<td>Constructs a <tt>std::filesystem::path</tt> object from the UTF-8
encoded string.</td>
<td>Ill-formed.</td>
<td>Constructs a <tt>std::filesystem::path</tt> object from the UTF-8
encoded string.</td>
</tr>
</table>
</p>
<h1 id="impact">Anticipated impact</h1>
<p>Code surveys have so far revealed little use of <tt>u8</tt> literals.
Google and Facebook have both reported less than 1000 occurrences in their
code bases, approximately half of which occur in test code. Representatives
of both organizations have stated that, given the actual size of their code
base, this is approximately equivalent to 0.
</p>
<p>Searches on Debian code search found uses in only a few packages and, within
those packages, a small number of uses (mostly single digit use counts), most
of which occurred in tests.
</p>
<p>Searches have been done on github as well, but github search doesn't
facilitate distinguishing uses of <tt>u8</tt> as identifiers (which is quite
common) vs use as a UTF-8 literal. Further, github doesn't provide a search
that filters out duplicate hits for the same source code in different
repositories. As a result, finding instances of <tt>u8</tt> literals is
challenging. Most cases that were identified were in tests included in clones
of Clang and gcc.
</p>
<p><tt>u8</tt> string literals were added in C++11, but support for <tt>u8</tt>
character literals was only added in C++17.
</p>
<h1 id="remediation">Remediation approaches</h1>
<p>A single approach to addressing backward compatibility impact is unlikely to
be the best approach for all projects. This section presents a number of
options to address various types of backward compatibility impact. In some
cases, the best solution may involve a mix of these options.
</p>
<p>Each of these approaches assumes a requirement for continued use of UTF-8
encoded literals with <tt>char</tt> based types. For most projects, such a
requirement is expected to be temporary while the project is fully migrated to
C++20. However, some projects may retain a sustained need for such literals.
For those projects, the <a href="#emulate">Emulate C++17 <tt>u8</tt>
literals</a> approach is able to address most cases of backward compatibility
impact.
</p>
<h2 id="disable">Disable <tt>char8_t</tt> support</h2>
<p>The simplest possible solution in the short term is to simply disable the
new features completely. Clang and gcc will allow disabling <tt>char8_t</tt>
features in both the language and standard library, via a <tt>-fno-char8_t</tt>
option. It is expected that Microsoft and EDG based compilers will offer a
similar option.
</p>
<p>This option should be considered a short-term solution to enable testing
existing C++17 code compiled as C++20 with minimal effort. This isn't a
viable long-term option as continued use would potentially complicate
composition with code that depends on the new features.
</p>
<h2 id="overload">Add overloads</h2>
<p>Adding function overloads that accept <tt>char8_t</tt> based types is an
effective step towards full migration to C++20. Ideally, older <tt>char</tt>
based functions would eventually be removed.
</p>
<table border="1">
<tr>
<th>Before</th>
<th>After</th>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">int ft(const char*);
ft(u8"text");</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">int ft(const char*);
<ins>#if defined(__cpp_char8_t)
int ft(const char8_t*);
#endif</ins>
ft(u8"text"); <ins>// C++17 or C++20</ins></code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">int operator ""_udl(const char*, unsigned long);
int v = u8"text"_udl;</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">int operator ""_udl(const char*, unsigned long);
<ins>#if defined(__cpp_char8_t)
int operator ""_udl(const char8_t*, unsigned long);
#endif</ins>
int v = u8"text"_udl; <ins>// C++17 or C++20</ins></code></pre>
</fieldset>
</td>
</tr>
</table>
<h2 id="ordinary">Change <tt>u8</tt> literals to ordinary literals with escape sequences</h2>
<p>This approach may be a reasonable option when the execution encoding is
ASCII based (but not UTF-8; otherwise just use ordinary literals) and
characters outside the basic source character set are infrequently used in
existing <tt>u8</tt> literals. This approach matches how code using UTF-8
had to be written prior to C++11.
</p>
<table border="1">
<tr>
<th>Before</th>
<th>After</th>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">u8"\u00E1"<br/></code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++"><ins>"\xC3\xA1" // U+00E1</ins></code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">u8"á"<br/>(assuming source encoding is UTF-8)</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++"><ins>"\xC3\xA1" // U+00E1</ins><br/>(works with any source encoding)</code></pre>
</fieldset>
</td>
</tr>
</table>
<h2 id="reinterpret_cast">reinterpret_cast <tt>u8</tt> literals to <tt>char</tt></h2>
<p>Common uses of <tt>u8</tt> literals can be handled in a backward compatible
manner through use of <tt>reinterpret_cast</tt>. Note that use of
<tt>reinterpret_cast</tt> is well-formed in these situations since
<a href="http://eel.is/c++draft/expr#basic.lval-11">lvalues of type
<tt>char</tt> may be used to access values of other types</a>. Such code is
valid in both C++17 and C++20.
</p>
<p>This approach may suffice when there are just a few uses of UTF-8 literals
that need to be addressed and the uses do not appear in <tt>constexpr</tt>
context. In general, sprinkling <tt>reinterpret_cast</tt> all over a code
base is not desirable.
</p>
<table border="1">
<tr>
<th>Before</th>
<th>After</th>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">const char &r = u8’x';</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">const char &r = <ins>reinterpret_cast<const char &>(</ins>u8’x'<ins>)</ins>; <ins>// C++17 or C++20</ins></code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">const char *p = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">const char *p = <ins>reinterpret_cast<const char *>(</ins>u8"text"<ins>)</ins>; <ins>// C++17 or C++20</ins></code></pre>
</fieldset>
</td>
</tr>
</table>
<h2 id="emulate">Emulate C++17 <tt>u8</tt> literals</h2>
<p>The techniques applied here are also applicable to the examples illustrated
in the prior section regarding use of <tt>reinterpret_cast</tt>. This approach
makes use of
<a title="Class Types in Non-Type Template Parameters"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0732r2.pdf">
P0732R2</a>
<sup><a title="Class Types in Non-Type Template Parameters"
href="#ref_p0732r2">
[P0732R2]</a></sup>
to enable constexpr UTF-8 encoded <tt>char</tt> based literals using a user
defined literal. The example code below defines overloaded character and
string UDL operators named <tt>_as_char</tt>. These UDLs can then be used in
place of existing UTF-8 character and string literals.
</p>
<p>
<fieldset>
<pre><code class="c++">#include <utility>
template<std::size_t N>
struct char8_t_string_literal {
static constexpr inline std::size_t size = N;
template<std::size_t... I>
constexpr char8_t_string_literal(
const char8_t (&r)[N],
std::index_sequence<I...>)
:
s{r[I]...}
{}
constexpr char8_t_string_literal(
const char8_t (&r)[N])
:
char8_t_string_literal(r, std::make_index_sequence<N>())
{}
auto operator <=>(const char8_t_string_literal&) = default;
char8_t s[N];
};
template<char8_t_string_literal L, std::size_t... I>
constexpr inline const char as_char_buffer[sizeof...(I)] =
{ static_cast<char>(L.s[I])... };
template<char8_t_string_literal L, std::size_t... I>
constexpr auto& make_as_char_buffer(std::index_sequence<I...>) {
return as_char_buffer<L, I...>;
}
constexpr char operator ""_as_char(char8_t c) {
return c;
}
template<char8_t_string_literal L>
constexpr auto& operator""_as_char() {
return make_as_char_buffer<L>(std::make_index_sequence<decltype(L)::size>());
}
</code></pre>
</fieldset>
</p>
<table border="1">
<tr>
<th>Before</th>
<th>After</th>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">constexpr const char &r = u8’x';</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">constexpr const char &r = u8’x'<ins>_as_char</ins>; <ins>// C++20 only</ins</code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">constexpr const char *p = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">constexpr const char *p = u8"text"<ins>_as_char</ins>; <ins>// C++20 only</ins</code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">// gcc extension in C++17; standard C++ doesn't permit conversion
// to arrays of unknown bound.
constexpr const char (&r)[] = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">// Ok in C++20 with <a title="Class Types in Non-Type Template Parameters" href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0732r2.pdf">P0388R2</a> <sup><a title="Class Types in Non-Type Template Parameters" href="#ref_p0388r2">[P0388R2]</a></sup>
constexpr const char (&r)[] = u8"text"<ins>_as_char</ins>; <ins>// C++20 only</ins</code></pre>
</fieldset>
</td>
</tr>
</table>
<p>When wrapped in macros, the above UDL can be used to retain source
compatibility across C++17 and C++20 for all known scenarios except for
array initialization.
<fieldset><pre><code class="c++">#if defined(__cpp_char8_t)
#define U8(x) u8##x##_as_char
#else
#define U8(x) u8##x
#endif
</code></pre></fieldset>
</p>
<table border="1">
<tr>
<th>Before</th>
<th>After</th>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">constexpr const char &r = u8’x';</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">constexpr const char &r = <ins>U8(’x')</ins>; <ins>// C++17 or C++20</ins</code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">constexpr const char *p = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">constexpr const char *p = <ins>U8("text")</ins>; <ins>// C++17 or C++20</ins</code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">// gcc extension in C++17; standard C++ doesn't permit conversion
// to arrays of unknown bound.
constexpr const char (&r)[] = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">// Ok in C++20 with <a title="Class Types in Non-Type Template Parameters" href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0732r2.pdf">P0388R2</a> <sup><a title="Class Types in Non-Type Template Parameters" href="#ref_p0388r2">[P0388R2]</a></sup>
constexpr const char (&r)[] = <ins>U8("text")</ins>; <ins>// C++17 or C++20</ins</code></pre>
</fieldset>
</td>
</tr>
</table>
<h2 id="array-subst">Substitute class types for C arrays initialized with <tt>u8</tt> string literals</h2>
<p>In C++17, arrays of <tt>char</tt> may be initialized with <tt>u8</tt> string
literals, but such initialization is ill-formed in C++20. C++17 behavior can
be emulated by substituting a class type with appropriate class template
argument deduction guides.
</p>
<p>
<fieldset>
<pre><code class="c++">#include <utility>
template<std::size_t N>
struct char_array {
template<std::size_t P, std::size_t... I>
constexpr char_array(
const char (&r)[P],
std::index_sequence<I...>)
:
data{(I<P?r[I]:'\0')...}
{}
template<std::size_t P, typename = std::enable_if_t<(P<=N)>>
constexpr char_array(const char(&r)[P])
: char_array(r, std::make_index_sequence<N>())
{}
#if defined(__cpp_char8_t)
template<std::size_t P, std::size_t... I>
constexpr char_array(
const char8_t (&r)[P],
std::index_sequence<I...>)
:
data{(I<P?static_cast<char>(r[I]):'\0')...}
{}
template<std::size_t P, typename = std::enable_if_t<(P<=N)>>
constexpr char_array(const char8_t(&r)[P])
: char_array(r, std::make_index_sequence<N>())
{}
#endif
constexpr (&operator const char() const)[N] {
return data;
}
constexpr (&operator char())[N] {
return data;
}
char data[N];
};
template<std::size_t N>
char_array(const char(&)[N]) -> char_array<N>;
#if defined(__cpp_char8_t)
template<std::size_t N>
char_array(const char8_t(&)[N]) -> char_array<N>;
#endif
</code></pre>
</fieldset>
</p>
<table border="1">
<tr>
<th>Before</th>
<th>After</th>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">char a[] = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++"><ins>char_array</ins> a = u8"text"; <ins>// Ok, initialized with "text\0"</ins></code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">constexpr char a[] = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">constexpr <ins>char_array</ins> a = u8"text"; <ins>// Ok, initialized with "text\0"</ins></code></pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">constexpr char a[3] = u8"text"; // ill-formed</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">constexpr <ins>char_array<3></ins> a = u8"text"; // ill-formed (too many initializers)</pre>
</fieldset>
</td>
</tr>
<tr>
<td>
<fieldset><pre><code class="c++">constexpr char a[6] = u8"text";</code></pre>
</fieldset>
</td>
<td>
<fieldset><pre><code class="c++">constexpr <ins>char_array<6></ins> a = u8"text"; <ins>// Ok, initialized with "text\0\0"</code></pre>
</fieldset>
</td>
</tr>
</table>
<h2 id="conversion_fns">Use explicit conversion functions</h2>
<p>Explicit conversion functions can be used, in a C++17 compatible manner,
to cope with the change of return type to the <tt>std::filesystem::path</tt>
member functions when a UTF-8 encoded path is desired in an object of type
<tt>std::string</tt>. For example:
</p>
<p>
<fieldset><pre><code class="c++">std::string from_u8string(const std::string &s) {
return s;
}
std::string from_u8string(std::string &&s) {
return std::move(s);
}
#if defined(__cpp_lib_char8_t)
std::string from_u8string(const std::u8string &s) {
return std::string(s.begin(), s.end());
}
#endif
std::filesystem::path p = ...;
std::string s = from_u8string(p.u8string()); // C++17 or C++20</code></pre>
</fieldset>
</p>
<p>This naturally incurs a cost when building with <tt>char8_t</tt> support
enabled due to the need to copy the path contents.
</p>
<h2 id="tooling">Tooling</h2>
<p>Tooling could potentially assist programmers in migrating code. Several of
the approaches discussed above could be applied mechanically to an existing
code base. For example, re-writing existing <tt>u8</tt> literals to ordinary
literals with escape sequences, or adding an <tt>_as_char</tt> UDL suffix to
existing literals (inserting include directives as needed).
</p>
<h1 id="options">Options considered to reduce backward compatibility impact</h1>
<p>The following sections summarize options that have been considered to
reduce backward compatibility impact. Most of these options are <em>not</em>
proposed in this paper because they would actively interfere with goals of
the <tt>char8_t</tt> proposal; to enable the type system to protect against
inadvertent mixing of UTF-8 data and the execution encoding. However, some
of these options may be useful for some code bases and could be provided by
implementations as opt-in extensions.
</p>
<p>Only two of these options (7 and 8) are proposed for inclusion in the
standard. In both of these cases, the concern that is addressed was not
specifically intended by the changes adopted in P0482R6. These are
effectively bug fixes.
</p>
<h2 id="option1">1) Reinstate <tt>u8</tt> literals as type <tt>char</tt> and introduce a new literal prefix for <tt>char8_t</tt></h2>
<p><em>Not proposed</em></p>
<p>Many of the backward compatibility concerns could be avoided by reinstating
<tt>u8</tt> literals as having type <tt>char</tt> and introducing a new prefix,
for example <tt>U8</tt>, to specify UTF-8 literals with type <tt>char8_t</tt>.
</p>
<p>The visible difference between <tt>u8</tt> and <tt>U8</tt> is subtle. Some
coding compliance standards, such as MISRA, forbid use of identifiers that
differ only in case. It has been suggested that C++11's use of <tt>u</tt> and
<tt>U</tt> to denote UTF-16 and UTF-32 literals was a mistake because the
visual distinction is too subtle. To avoid these subtle visual differences,
new literal prefixes such as <tt>utf8</tt>, <tt>utf16</tt>, and <tt>utf32</tt>
could be introduced and the old ones deprecated. The downside of these
prefixes is, of course, that they are longer.
</p>
<p>Implementing this option would continue enabling problems with encoding
confusion that we see today. The execution encoding is not UTF-8 on some
popular platforms and continuing to use <tt>char</tt> based types for
execution encoding and UTF-8 (and other untrusted input or encodings) is a
recipe for continued occurrences of mojibake in applications. For platforms
that use UTF-8 as the execution encoding, ordinary literals are already UTF-8
encoded. This option would introduce three distinct ways of writing UTF-8
literals on such platforms; having two ways to do (almost) the same things is
usually one too many already.
</p>
<h2 id="option2">2) Allow implicit conversions from <tt>char8_t</tt> to <tt>char</tt></h2>
<p><em>Not proposed</em></p>
<p>Allowing implicit conversions from <tt>char8_t</tt> to <tt>char</tt> was
considered with the original P0482 proposal. The concerns with this approach
are the same as in option 1; this enables continued, potentially unintended,
mixing of UTF-8 data with non-UTF-8 data resulting in mojibake.
<p>Additionally, allowing implicit conversions would not address all
compatibility concerns. For example:
<fieldset><pre><code class="c++">template<typename T> void f(T); // #1
void f(char); // #2
f(u8'x'); // Calls #2 in C++17, would still call #1 in C++20.</code></pre></fieldset>
</p>
<p>However, such implicit conversions could still be useful for some existing
code. Implementations could offer extensions to enable such conversions.
</p>
<h2 id="option3">3) Allow initializing an array of <tt>char</tt> with a <tt>u8</tt> string literal</h2>
<p><em>Not proposed</em></p>
<p>This option would allow the following code to remain well-formed in C++20.
</p>
<fieldset><pre><code class="c++">char a[] = u8"text";</code></pre></fieldset>
<p>Array initialization is the one context in which the previously discussed
uses of <tt>reinterpret_cast</tt> or the <tt>_as_char</tt> UDL isn't an option.
This option would allow array initializations to remain well-formed and avoid
the need for workarounds like the previously discussed <tt>char_array</tt>
template. However, this option would continue to promote mixing of UTF-8 data
with non-UTF-8 data potentially resulting in mojibake.
</p>
<p>Implementations could allow these initializations as a conforming extension.
</p>
<h2 id="option4">4) Allow initializing an array with a reference to an array</h2>
<p><em>Not proposed</em></p>
<p>This option would enable use of the previously discussed <tt>_as_char</tt>
UDL to initialize an array without the need for workarounds like the previously
discussed <tt>char_array</tt> template. However, this option would continue to
promote mixing of UTF-8 data with non-UTF-8 data potentially resulting in
mojibake.
</p>
<fieldset><pre><code class="c++">char a[] = u8"text"_as_char;</code></pre></fieldset>
<p>Implementations could allow these initializations as a conforming extension.
</p>
<h2 id="option5">5) Allow <tt>std::string</tt> to be initialized with <tt>char8_t</tt> based types</h2>
<p><em>Not proposed</em></p>
<p>This option has been suggested as a way to allow some existing uses of
<tt>std::string</tt> to hold UTF-8 data to remain valid in C++20. For
example:
</p>
<fieldset><pre><code class="c++">std::string s1 = u8"text";
std::string s2 = s1 + u8"text";</code></pre></fieldset>
<p>This option constitutes a narrow fix for a few specific use cases within a
considerably larger problem space. Further, it would require changes to
<tt>std::basic_string</tt> specifically for its <tt>char</tt>-based
specializations. As with previously discussed options, this would again
continue to promote mixing of UTF-8 data with non-UTF-8 data potentially
resulting in mojibake.
</p>
<h2 id="option6">6) Allow implicit conversions from <tt>std::u8string</tt> to <tt>std::string</tt></h2>
<p><em>Not proposed</em></p>
<p>This option has been suggested as a means to address the backward
compatibility impact due to the changes to the <tt>std::filesystem::path</tt>
<tt>u8string</tt> and <tt>generic_u8string</tt> member functions. It would
allow code like the following to continue to work as expected:
</p>
<fieldset><pre><code class="c++">std::filesystem::path p = ...;
std::string s1 = p.u8string();</code></pre></fieldset>
<p>This option is, again, not proposed because it would allow unintended
mixing of UTF-8 encoded data and the execution character encoding.
</p>
<h2 id="option7">7) Add deleted ostream inserters for <tt>char8_t</tt>, <tt>char16_t</tt>, and <tt>char32_t</tt></h2>
<p><em>Proposed</em></p>
<p>An unintended and silent behavioral change was introduced with the adoption
of P0482R6. In C++17, the following code wrote the code units of the literals
to stdout. In C++20, this code now writes the character literal as a number,
and the address of the string literal, to stdout.
</p>