Skip to content

[lex.charset] Define 'valid encoding' #5101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 18 additions & 8 deletions source/lex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -346,9 +346,11 @@
to a wide character or string literal.

\pnum
A literal encoding or a locale-specific encoding of one of
the execution character sets\iref{character.seq}
encodes each element of the basic literal character set as
An encoding is \defnx{valid}{encoding!valid} if all of the following
conditions are satisfied:
\begin{itemize}
\item
Each element of the basic literal character set is encoded as
a single code unit with non-negative value,
distinct from the code unit for any other such element.
\begin{note}
Expand All @@ -357,22 +359,30 @@
the value of such a code unit can be the same as
that of a code unit for an element of the basic literal character set.
\end{note}
\item
\indextext{character!null}%
\indextext{wide-character!null}%
The \unicode{0000}{null} character is encoded as the value \tcode{0}.
No other element of the translation character set
The \unicode{0000}{null} character is encoded as the value \tcode{0};
no other element of the translation character set
is encoded with a code unit of value \tcode{0}.
\item
The code unit value of each decimal digit character after the digit \tcode{0} (\ucode{0030})
shall be one greater than the value of the previous.
The ordinary and wide literal encodings are otherwise
\impldef{ordinary and wide literal encodings}.
is one greater than the value of the previous.
\end{itemize}

\pnum
The ordinary and wide literal encodings are valid encodings,
but are otherwise \impldef{ordinary and wide literal encodings}.
\indextext{UTF-8}%
\indextext{UTF-16}%
\indextext{UTF-32}%
For a UTF-8, UTF-16, or UTF-32 literal,
the UCS scalar value
corresponding to each character of the translation character set
is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.
\begin{note}
Those encodings are also valid encodings.
\end{note}
\indextext{character set|)}

\rSec1[lex.pptoken]{Preprocessing tokens}
Expand Down
7 changes: 4 additions & 3 deletions source/lib-intro.tex
Original file line number Diff line number Diff line change
Expand Up @@ -654,9 +654,10 @@
\item
The \defnadj{execution}{character set} and
the \defnadj{execution}{wide-character set}
are supersets of the basic literal character set\iref{lex.charset}.
The encodings of the execution character sets and
the sets of additional elements (if any) are locale-specific.
are supersets of the basic literal character set.
The sets of additional elements (if any) are locale-specific.
The encodings of the execution character sets are locale-specific,
but valid\iref{lex.charset}.
\begin{note}
The encodings of the execution character sets can be unrelated
to any literal encoding.
Expand Down