Skip to content

Commit cd11068

Browse files
miss-islingtonMa Lin
and
Ma Lin
authored
bpo-38056: overhaul Error Handlers section in codecs documentation (GH-15732)
* Some handlers were wrongly described as text-encoding only, but actually they can also be used in text-decoding. * Add more description to each handler. * Add two REPL examples. * Add indexes for Error Handler's name. Co-authored-by: Kyle Stanley <aeros167@gmail.com> Co-authored-by: Victor Stinner <vstinner@python.org> Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com> (cherry picked from commit 5bc2390) Co-authored-by: Ma Lin <animalize@users.noreply.github.com>
1 parent 1585796 commit cd11068

File tree

3 files changed

+127
-74
lines changed

3 files changed

+127
-74
lines changed

Doc/glossary.rst

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1136,7 +1136,16 @@ Glossary
11361136
See also :term:`borrowed reference`.
11371137

11381138
text encoding
1139-
A codec which encodes Unicode strings to bytes.
1139+
A string in Python is a sequence of Unicode code points (in range
1140+
``U+0000``--``U+10FFFF``). To store or transfer a string, it needs to be
1141+
serialized as a sequence of bytes.
1142+
1143+
Serializing a string into a sequence of bytes is known as "encoding", and
1144+
recreating the string from the sequence of bytes is known as "decoding".
1145+
1146+
There are a variety of different text serialization
1147+
:ref:`codecs <standard-encodings>`, which are collectively referred to as
1148+
"text encodings".
11401149

11411150
text file
11421151
A :term:`file object` able to read and write :class:`str` objects.

Doc/library/codecs.rst

Lines changed: 116 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@
2323
This module defines base classes for standard Python codecs (encoders and
2424
decoders) and provides access to the internal Python codec registry, which
2525
manages the codec and error handling lookup process. Most standard codecs
26-
are :term:`text encodings <text encoding>`, which encode text to bytes,
27-
but there are also codecs provided that encode text to text, and bytes to
28-
bytes. Custom codecs may encode and decode between arbitrary types, but some
29-
module features are restricted to use specifically with
30-
:term:`text encodings <text encoding>`, or with codecs that encode to
26+
are :term:`text encodings <text encoding>`, which encode text to bytes (and
27+
decode bytes to text), but there are also codecs provided that encode text to
28+
text, and bytes to bytes. Custom codecs may encode and decode between arbitrary
29+
types, but some module features are restricted to be used specifically with
30+
:term:`text encodings <text encoding>` or with codecs that encode to
3131
:class:`bytes`.
3232

3333
The module defines the following functions for encoding and decoding with
@@ -297,58 +297,56 @@ codec will handle encoding and decoding errors.
297297
Error Handlers
298298
^^^^^^^^^^^^^^
299299

300-
To simplify and standardize error handling,
301-
codecs may implement different error handling schemes by
302-
accepting the *errors* string argument. The following string values are
303-
defined and implemented by all standard Python codecs:
300+
To simplify and standardize error handling, codecs may implement different
301+
error handling schemes by accepting the *errors* string argument:
304302

305-
.. tabularcolumns:: |l|L|
306-
307-
+-------------------------+-----------------------------------------------+
308-
| Value | Meaning |
309-
+=========================+===============================================+
310-
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
311-
| | this is the default. Implemented in |
312-
| | :func:`strict_errors`. |
313-
+-------------------------+-----------------------------------------------+
314-
| ``'ignore'`` | Ignore the malformed data and continue |
315-
| | without further notice. Implemented in |
316-
| | :func:`ignore_errors`. |
317-
+-------------------------+-----------------------------------------------+
318-
319-
The following error handlers are only applicable to
320-
:term:`text encodings <text encoding>`:
303+
>>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace')
304+
b'German \\xdf, \\u266c'
305+
>>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace')
306+
b'German &#223;, &#9836;'
321307

322308
.. index::
309+
pair: strict; error handler's name
310+
pair: ignore; error handler's name
311+
pair: replace; error handler's name
312+
pair: backslashreplace; error handler's name
313+
pair: surrogateescape; error handler's name
323314
single: ? (question mark); replacement character
324315
single: \ (backslash); escape sequence
325316
single: \x; escape sequence
326317
single: \u; escape sequence
327318
single: \U; escape sequence
328-
single: \N; escape sequence
319+
320+
The following error handlers can be used with all Python
321+
:ref:`standard-encodings` codecs:
322+
323+
.. tabularcolumns:: |l|L|
329324

330325
+-------------------------+-----------------------------------------------+
331326
| Value | Meaning |
332327
+=========================+===============================================+
333-
| ``'replace'`` | Replace with a suitable replacement |
334-
| | marker; Python will use the official |
335-
| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
336-
| | built-in codecs on decoding, and '?' on |
337-
| | encoding. Implemented in |
338-
| | :func:`replace_errors`. |
328+
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass), |
329+
| | this is the default. Implemented in |
330+
| | :func:`strict_errors`. |
339331
+-------------------------+-----------------------------------------------+
340-
| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
341-
| | reference (only for encoding). Implemented |
342-
| | in :func:`xmlcharrefreplace_errors`. |
332+
| ``'ignore'`` | Ignore the malformed data and continue without|
333+
| | further notice. Implemented in |
334+
| | :func:`ignore_errors`. |
335+
+-------------------------+-----------------------------------------------+
336+
| ``'replace'`` | Replace with a replacement marker. On |
337+
| | encoding, use ``?`` (ASCII character). On |
338+
| | decoding, use ```` (U+FFFD, the official |
339+
| | REPLACEMENT CHARACTER). Implemented in |
340+
| | :func:`replace_errors`. |
343341
+-------------------------+-----------------------------------------------+
344342
| ``'backslashreplace'`` | Replace with backslashed escape sequences. |
343+
| | On encoding, use hexadecimal form of Unicode |
344+
| | code point with formats ``\xhh`` ``\uxxxx`` |
345+
| | ``\Uxxxxxxxx``. On decoding, use hexadecimal |
346+
| | form of byte value with format ``\xhh``. |
345347
| | Implemented in |
346348
| | :func:`backslashreplace_errors`. |
347349
+-------------------------+-----------------------------------------------+
348-
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
349-
| | (only for encoding). Implemented in |
350-
| | :func:`namereplace_errors`. |
351-
+-------------------------+-----------------------------------------------+
352350
| ``'surrogateescape'`` | On decoding, replace byte with individual |
353351
| | surrogate code ranging from ``U+DC80`` to |
354352
| | ``U+DCFF``. This code will then be turned |
@@ -358,27 +356,55 @@ The following error handlers are only applicable to
358356
| | more.) |
359357
+-------------------------+-----------------------------------------------+
360358

359+
.. index::
360+
pair: xmlcharrefreplace; error handler's name
361+
pair: namereplace; error handler's name
362+
single: \N; escape sequence
363+
364+
The following error handlers are only applicable to encoding (within
365+
:term:`text encodings <text encoding>`):
366+
367+
+-------------------------+-----------------------------------------------+
368+
| Value | Meaning |
369+
+=========================+===============================================+
370+
| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character |
371+
| | reference, which is a decimal form of Unicode |
372+
| | code point with format ``&#num;`` Implemented |
373+
| | in :func:`xmlcharrefreplace_errors`. |
374+
+-------------------------+-----------------------------------------------+
375+
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, |
376+
| | what appears in the braces is the Name |
377+
| | property from Unicode Character Database. |
378+
| | Implemented in :func:`namereplace_errors`. |
379+
+-------------------------+-----------------------------------------------+
380+
381+
.. index::
382+
pair: surrogatepass; error handler's name
383+
361384
In addition, the following error handler is specific to the given codecs:
362385

363386
+-------------------+------------------------+-------------------------------------------+
364387
| Value | Codecs | Meaning |
365388
+===================+========================+===========================================+
366-
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
367-
| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |
368-
| | utf-32-be, utf-32-le | presence of surrogates as an error. |
389+
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding surrogate code|
390+
| | utf-16-be, utf-16-le, | point (``U+D800`` - ``U+DFFF``) as normal |
391+
| | utf-32-be, utf-32-le | code point. Otherwise these codecs treat |
392+
| | | the presence of surrogate code point in |
393+
| | | :class:`str` as an error. |
369394
+-------------------+------------------------+-------------------------------------------+
370395

371396
.. versionadded:: 3.1
372397
The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
373398

374399
.. versionchanged:: 3.4
375-
The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
400+
The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\*
401+
codecs.
376402

377403
.. versionadded:: 3.5
378404
The ``'namereplace'`` error handler.
379405

380406
.. versionchanged:: 3.5
381-
The ``'backslashreplace'`` error handlers now works with decoding and
407+
The ``'backslashreplace'`` error handler now works with decoding and
382408
translating.
383409

384410
The set of allowed values can be extended by registering a new named error
@@ -421,42 +447,59 @@ functions:
421447

422448
.. function:: strict_errors(exception)
423449

424-
Implements the ``'strict'`` error handling: each encoding or
425-
decoding error raises a :exc:`UnicodeError`.
450+
Implements the ``'strict'`` error handling.
426451

452+
Each encoding or decoding error raises a :exc:`UnicodeError`.
427453

428-
.. function:: replace_errors(exception)
429454

430-
Implements the ``'replace'`` error handling (for :term:`text encodings
431-
<text encoding>` only): substitutes ``'?'`` for encoding errors
432-
(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
433-
character) for decoding errors.
455+
.. function:: ignore_errors(exception)
434456

457+
Implements the ``'ignore'`` error handling.
435458

436-
.. function:: ignore_errors(exception)
459+
Malformed data is ignored; encoding or decoding is continued without
460+
further notice.
437461

438-
Implements the ``'ignore'`` error handling: malformed data is ignored and
439-
encoding or decoding is continued without further notice.
440462

463+
.. function:: replace_errors(exception)
441464

442-
.. function:: xmlcharrefreplace_errors(exception)
465+
Implements the ``'replace'`` error handling.
443466

444-
Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
445-
:term:`text encodings <text encoding>` only): the
446-
unencodable character is replaced by an appropriate XML character reference.
467+
Substitutes ``?`` (ASCII character) for encoding errors or ```` (U+FFFD,
468+
the official REPLACEMENT CHARACTER) for decoding errors.
447469

448470

449471
.. function:: backslashreplace_errors(exception)
450472

451-
Implements the ``'backslashreplace'`` error handling (for
452-
:term:`text encodings <text encoding>` only): malformed data is
453-
replaced by a backslashed escape sequence.
473+
Implements the ``'backslashreplace'`` error handling.
474+
475+
Malformed data is replaced by a backslashed escape sequence.
476+
On encoding, use the hexadecimal form of Unicode code point with formats
477+
``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use the hexadecimal form of
478+
byte value with format ``\xhh``.
479+
480+
.. versionchanged:: 3.5
481+
Works with decoding and translating.
482+
483+
484+
.. function:: xmlcharrefreplace_errors(exception)
485+
486+
Implements the ``'xmlcharrefreplace'`` error handling (for encoding within
487+
:term:`text encoding` only).
488+
489+
The unencodable character is replaced by an appropriate XML/HTML numeric
490+
character reference, which is a decimal form of Unicode code point with
491+
format ``&#num;`` .
492+
454493

455494
.. function:: namereplace_errors(exception)
456495

457-
Implements the ``'namereplace'`` error handling (for encoding with
458-
:term:`text encodings <text encoding>` only): the
459-
unencodable character is replaced by a ``\N{...}`` escape sequence.
496+
Implements the ``'namereplace'`` error handling (for encoding within
497+
:term:`text encoding` only).
498+
499+
The unencodable character is replaced by a ``\N{...}`` escape sequence. The
500+
set of characters that appear in the braces is the Name property from
501+
Unicode Character Database. For example, the German lowercase letter ``'ß'``
502+
will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` .
460503

461504
.. versionadded:: 3.5
462505

@@ -470,7 +513,7 @@ The base :class:`Codec` class defines these methods which also define the
470513
function interfaces of the stateless encoder and decoder:
471514

472515

473-
.. method:: Codec.encode(input[, errors])
516+
.. method:: Codec.encode(input, errors='strict')
474517

475518
Encodes the object *input* and returns a tuple (output object, length consumed).
476519
For instance, :term:`text encoding` converts
@@ -488,7 +531,7 @@ function interfaces of the stateless encoder and decoder:
488531
of the output object type in this situation.
489532

490533

491-
.. method:: Codec.decode(input[, errors])
534+
.. method:: Codec.decode(input, errors='strict')
492535

493536
Decodes the object *input* and returns a tuple (output object, length
494537
consumed). For instance, for a :term:`text encoding`, decoding converts
@@ -555,7 +598,7 @@ define in order to be compatible with the Python codec registry.
555598
object.
556599

557600

558-
.. method:: encode(object[, final])
601+
.. method:: encode(object, final=False)
559602

560603
Encodes *object* (taking the current state of the encoder into account)
561604
and returns the resulting encoded object. If this is the last call to
@@ -612,7 +655,7 @@ define in order to be compatible with the Python codec registry.
612655
object.
613656

614657

615-
.. method:: decode(object[, final])
658+
.. method:: decode(object, final=False)
616659

617660
Decodes *object* (taking the current state of the decoder into account)
618661
and returns the resulting decoded object. If this is the last call to
@@ -746,7 +789,7 @@ compatible with the Python codec registry.
746789
:func:`register_error`.
747790

748791

749-
.. method:: read([size[, chars, [firstline]]])
792+
.. method:: read(size=-1, chars=-1, firstline=False)
750793

751794
Decodes data from the stream and returns the resulting object.
752795

@@ -772,7 +815,7 @@ compatible with the Python codec registry.
772815
available on the stream, these should be read too.
773816

774817

775-
.. method:: readline([size[, keepends]])
818+
.. method:: readline(size=None, keepends=True)
776819

777820
Read one line from the input stream and return the decoded data.
778821

@@ -783,7 +826,7 @@ compatible with the Python codec registry.
783826
returned.
784827

785828

786-
.. method:: readlines([sizehint[, keepends]])
829+
.. method:: readlines(sizehint=None, keepends=True)
787830

788831
Read all lines available on the input stream and return them as a list of
789832
lines.
@@ -874,7 +917,7 @@ Encodings and Unicode
874917
---------------------
875918

876919
Strings are stored internally as sequences of code points in
877-
range ``0x0``--``0x10FFFF``. (See :pep:`393` for
920+
range ``U+0000``--``U+10FFFF``. (See :pep:`393` for
878921
more details about the implementation.)
879922
Once a string object is used outside of CPU and memory, endianness
880923
and how these arrays are stored as bytes become an issue. As with other
@@ -955,7 +998,7 @@ encoding was used for encoding a string. Each charmap encoding can
955998
decode any random byte sequence. However that's not possible with UTF-8, as
956999
UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
9571000
sequences. To increase the reliability with which a UTF-8 encoding can be
958-
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
1001+
detected, Microsoft invented a variant of UTF-8 (that Python calls
9591002
``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
9601003
is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
9611004
sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Overhaul the :ref:`error-handlers` documentation in :mod:`codecs`.

0 commit comments

Comments
 (0)