Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 135 additions & 34 deletions pep-0597.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ The warning is disabled by default. New ``-X warn_encoding``
command-line option and ``PYTHONWARNENCODING`` environment variable
are used to enable the warnings.

``encoding="locale"`` option is added too. It is used to specify
locale encoding explicitly.


Motivation
==========
Expand All @@ -39,34 +42,57 @@ in the ``README.md`` file which is encoded in UTF-8.
For example, 489 packages of the 4000 most downloaded packages from
PyPI used non-ASCII characters in README. And 82 packages of them
can not be installed from source package when locale encoding is
ASCII. [1_] They used the default encoding to read README or TOML
ASCII. [1]_ They used the default encoding to read README or TOML
file.

Another example is ``logging.basicConfig(filename="log.txt")``.
Some users expect UTF-8 is used by default, but locale encoding is
used actually. [2_]
used actually. [2]_

Even Python experts assume that default encoding is UTF-8.
It creates bugs that happen only on Windows. See [3_] and [4_].
It creates bugs that happen only on Windows. See [3]_, [4]_, [5]_,
and [6]_ for example.

Emitting a warning when the ``encoding`` option is omitted will help
to find such mistakes.


Explicit way to use locale-specific encoding
--------------------------------------------

``open(filename)`` isn't explicit about which encoding is expected:

* Expects ASCII (not a bug, but inefficient on Windows)
* Expects UTF-8 (bug or platform specific script)
* Expects the locale encoding.

In this point of view, ``open(filename)`` is not readable.

``encoding=locale.getpreferredencoding(False)`` can be used to
specify the locale encoding explicitly. But it is too long and easy
to misuse. (e.g. forget to pass ``False`` to its parameter)

This PEP provides an explicit way to specify the locale encoding.


Prepare to change the default encoding to UTF-8
-----------------------------------------------

We had chosen to use locale encoding for the default text encoding in
Python 3.0. But UTF-8 has been adopted very widely since then.
Since UTF-8 becomes de-facto standard text encoding, we might change
the default text encoding to UTF-8 in the future.

We might change the default text encoding to UTF-8 in the future.
But this change will affect many applications and libraries.
Many ``DeprecationWarning`` will be emitted if we start emitting the
warning by default. It will be too noisy.
But this change will affect many applications and libraries. If we
start emitting ``DeprecationWarning`` everywhere ``encoding`` option
is omitted by default, it will be too noisy and painful.

Although this PEP doesn't propose to change the default encoding,
this PEP will help to reduce the warning in the future if we decide
to change the default encoding.
this PEP will the change:

* Reduce the number of omitted ``encoding`` option in many libraries
before emitting the warning by default.

* Users will be able to use ``encoding="locale"`` option to suppress
the warning without dropping Python 3.10 support.


Specification
Expand All @@ -75,7 +101,7 @@ Specification
``EncodingWarning``
--------------------

Add new ``EncodingWarning`` warning class which is a subclass of
Add a new ``EncodingWarning`` warning class which is a subclass of
``Warning``. It is used to warn when the ``encoding`` option is
omitted and the default encoding is locale-specific.

Expand All @@ -94,6 +120,9 @@ When the option is enabled, ``io.TextIOWrapper()``, ``open()``, and
other modules using them will emit ``EncodingWarning`` when
``encoding`` is omitted.

Since ``EncodingWarning`` is a subclass of ``Warning``, they are
shown by default, unlike ``DeprecationWarning``.


``encoding="locale"`` option
----------------------------
Expand All @@ -102,21 +131,6 @@ other modules using them will emit ``EncodingWarning`` when
same to current ``encoding=None``. But ``io.TextIOWrapper`` doesn't
emit ``EncodingWarning`` when ``encoding="locale"`` is specified.

Add ``io.LOCALE_ENCODING = "locale"`` constant too. This constant can
be used to avoid confusing ``LookupError: unknown encoding: locale``
error when the code is run in old Python accidentally.

The constant can be used to test that ``encoding="locale"`` option is
supported too. For example,

.. code-block::

# Want to suppress an EncodingWarning but still need support
# old Python versions.
locale_encoding = getattr(io, "LOCALE_ENCODING", None)
with open(filename, encoding=locale_encoding) as f:
...


``io.text_encoding()``
-----------------------
Expand Down Expand Up @@ -145,7 +159,7 @@ Pure Python implementation will be like this::
import warnings
warnings.warn("'encoding' option is omitted",
EncodingWarning, stacklevel + 2)
encoding = LOCALE_ENCODING
encoding = "locale"
return encoding

For example, ``pathlib.Path.read_text()`` can use the function like:
Expand All @@ -158,20 +172,20 @@ For example, ``pathlib.Path.read_text()`` can use the function like:
return f.read()

By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for
the caller of ``read_text()`` instead of ``read_text()``.
the caller of ``read_text()`` instead of ``read_text()`` itself.


Affected stdlibs
-------------------
-----------------

Many stdlibs will be affected by this change.

Most APIs accepting ``encoding=None`` will use ``io.text_encoding()``
as written in the previous section.

Where using locale encoding as the default encoding is reasonable,
``encoding=io.LOCALE_ENCODING`` will be used instead. For example,
``subprocess`` module will use locale encoding for the default
``encoding="locale"`` will be used instead. For example,
the ``subprocess`` module will use locale encoding for the default
encoding of the pipes.

Many tests use ``open()`` without ``encoding`` specified to read
Expand All @@ -185,7 +199,7 @@ Opt-in warning
---------------

Although ``DeprecationWarning`` is suppressed by default, emitting
``DeprecationWarning`` always when ``encoding`` option is omitted
``DeprecationWarning`` always when the ``encoding`` option is omitted
would be too noisy.

Noisy warnings may lead developers to dismiss the
Expand All @@ -203,12 +217,82 @@ when ``encoding=None``. This behavior can not be implemented in
the codec.


Backward Compatibility
======================

The new warning is not emitted by default. So this PEP is 100%
backward compatible.


Forward Compatibility
=====================

``encoding="locale"`` option is not forward compatible. Codes
using the option will not work on Python older than 3.10. It will
raise ``LookupError: unknown encoding: locale``.

Until developers can drop Python 3.9 support, ``EncodingWarning``
can be used only for finding missing ``encoding="utf-8"`` options.


How to teach this
=================

For new users
-------------

Since ``EncodingWarning`` is used to write a cross-platform code,
no need to teach it to new users.

We can just recommend using UTF-8 for text files and use
``encoding="utf-8"`` when opening test files.


For experienced users
---------------------

Using ``open(filename)`` to read text files encoded in UTF-8 is a
common mistake. It may not work on Windows because UTF-8 is not the
default encoding.

You can use ``-X warn_encoding`` or ``PYTHONWARNENCODING=1`` to find
this type of mistake.

Omitting ``encoding`` option is not a bug when opening text files
encoded in locale encoding. But ``encoding="locale"`` is recommended
after Python 3.10 because it is more explicit.


Reference Implementation
========================

https://github.com/python/cpython/pull/19481


Discussions
===========

* Why not implement this in linters?

* ``encoding="locale"`` and ``io.text_encoding()`` must be in
Python.

* It is difficult to find all caller of functions wrapping
``open()`` or ``TextIOWrapper()``. (See ``io.text_encoding()``
section.)

* Many developers will not use the option.

* Some developers use the option and report the warnings to
libraries they use. So the option is worth enough even though
many developers won't use it.

* For example, I find [7]_ and [8]_ by running
``pip install -U pip`` and find [9]_ by running ``tox``
with the reference implementation. It demonstrates how this
option find potential issues.


References
==========

Expand All @@ -225,11 +309,28 @@ References
.. [4] ``json.tool`` had used locale encoding to read JSON files.
(https://bugs.python.org/issue33684)

.. [5] site: Potential UnicodeDecodeError when handling pth file
(https://bugs.python.org/issue33684)

.. [6] pypa/pip: "Installing packages fails if Python 3 installed
into path with non-ASCII characters"
(https://github.com/pypa/pip/issues/9054)

.. [7] "site: Potential UnicodeDecodeError when handling pth file"
(https://bugs.python.org/issue43214)

.. [8] "[pypa/pip] Use ``encoding`` option or binary mode for open()"
(https://github.com/pypa/pip/pull/9608)

.. [9] "Possible UnicodeError caused by missing encoding="utf-8""
(https://github.com/tox-dev/tox/issues/1908)


Copyright
=========

This document has been placed in the public domain.
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.


..
Expand Down