Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove chardet/charset-normalizer. Add fallback_charset_resolver ClientSession parameter. #7561

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ebf65e8
Add 'fallback_encoding' to ClientSession
john-parton Aug 26, 2023
3a85c65
Remove unmaintained admonition from glossary.
john-parton Aug 29, 2023
3335841
Update glossary entries for charset-normalizer and cchardet with note…
john-parton Aug 29, 2023
6a173d7
Document raising UnicodeDecodeError and link get_encoding.
john-parton Aug 29, 2023
a39683a
Add advanced usage for character set detection.
john-parton Aug 29, 2023
de6eacf
Update docs/client_advanced.rst
john-parton Aug 29, 2023
e9f919e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 29, 2023
f9df6d2
Find/replace -> .
john-parton Aug 29, 2023
1a3cb6c
Fix None error
Dreamsorcerer Aug 29, 2023
e61a02a
Doc fixes
Dreamsorcerer Aug 29, 2023
4f136c1
Nitpick
Dreamsorcerer Aug 29, 2023
98846f6
Update 7561.feature
Dreamsorcerer Aug 29, 2023
eed6f0f
Fix random type error
Dreamsorcerer Aug 29, 2023
522724f
Hack for tests
Dreamsorcerer Aug 29, 2023
911cdc5
Update client_reqrep.py
Dreamsorcerer Aug 29, 2023
7cdb499
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 29, 2023
849a928
Update client_reqrep.py
Dreamsorcerer Aug 29, 2023
40f4821
Update client_reference.rst
Dreamsorcerer Aug 29, 2023
b6161ab
Update client_reference.rst
Dreamsorcerer Aug 29, 2023
83802d3
Fix test
Dreamsorcerer Aug 29, 2023
527afdc
Update test_client_response.py
Dreamsorcerer Aug 29, 2023
b512d76
Update client_reference.rst
Dreamsorcerer Aug 29, 2023
ec5a67c
Update client_advanced.rst
Dreamsorcerer Aug 29, 2023
f3d0e4f
Update client.py
Dreamsorcerer Aug 29, 2023
ce8f59b
Update client_reqrep.py
Dreamsorcerer Aug 29, 2023
f80d16e
Update 7561.feature
Dreamsorcerer Aug 29, 2023
f1f58d1
Update client_advanced.rst
Dreamsorcerer Aug 29, 2023
b459bbf
Update client_reference.rst
Dreamsorcerer Aug 29, 2023
e1e8346
Update test_client_response.py
Dreamsorcerer Aug 29, 2023
fe16619
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 29, 2023
ef83b00
Update client_reference.rst
Dreamsorcerer Aug 29, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,6 @@ ignore_missing_imports = True
[mypy-brotli]
ignore_missing_imports = True

[mypy-cchardet]
ignore_missing_imports = True

[mypy-gunicorn.*]
ignore_missing_imports = True

Expand Down
2 changes: 2 additions & 0 deletions CHANGES/7561.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Replace automatic character set detection with a `fallback_charset_resolver` parameter
in `ClientSession` to allow user-supplied character set detection functions.
1 change: 1 addition & 0 deletions CONTRIBUTORS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ Jesus Cea
Jian Zeng
Jinkyu Yi
Joel Watts
John Parton
Jon Nabozny
Jonas Krüger Svensson
Jonas Obrist
Expand Down
6 changes: 1 addition & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,21 +162,17 @@ Requirements
============

- async-timeout_
- charset-normalizer_
- multidict_
- yarl_
- frozenlist_

Optionally you may install the cChardet_ and aiodns_ libraries (highly
recommended for sake of speed).
Optionally you may install the aiodns_ library (highly recommended for sake of speed).

.. _charset-normalizer: https://pypi.org/project/charset-normalizer
.. _aiodns: https://pypi.python.org/pypi/aiodns
.. _multidict: https://pypi.python.org/pypi/multidict
.. _frozenlist: https://pypi.org/project/frozenlist/
.. _yarl: https://pypi.python.org/pypi/yarl
.. _async-timeout: https://pypi.python.org/pypi/async_timeout
.. _cChardet: https://pypi.python.org/pypi/cchardet

License
=======
Expand Down
5 changes: 5 additions & 0 deletions aiohttp/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ class ClientTimeout:
DEFAULT_TIMEOUT: Final[ClientTimeout] = ClientTimeout(total=5 * 60)

_RetType = TypeVar("_RetType")
_CharsetResolver = Callable[[ClientResponse, bytes], str]


@final
Expand Down Expand Up @@ -192,6 +193,7 @@ class ClientSession:
"_read_bufsize",
"_max_line_size",
"_max_field_size",
"_resolve_charset",
)

def __init__(
Expand Down Expand Up @@ -221,6 +223,7 @@ def __init__(
read_bufsize: int = 2**16,
max_line_size: int = 8190,
max_field_size: int = 8190,
fallback_charset_resolver: _CharsetResolver = lambda r, b: "utf-8",
) -> None:
if base_url is None or isinstance(base_url, URL):
self._base_url: Optional[URL] = base_url
Expand Down Expand Up @@ -291,6 +294,8 @@ def __init__(
for trace_config in self._trace_configs:
trace_config.freeze()

self._resolve_charset = fallback_charset_resolver

def __init_subclass__(cls: Type["ClientSession"]) -> None:
raise TypeError(
"Inheritance class {} from ClientSession "
Expand Down
52 changes: 26 additions & 26 deletions aiohttp/client_reqrep.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from typing import (
TYPE_CHECKING,
Any,
Callable,
Dict,
Iterable,
List,
Expand Down Expand Up @@ -73,11 +74,6 @@
ssl = None # type: ignore[assignment]
SSLContext = object # type: ignore[misc,assignment]

try:
import cchardet as chardet
except ImportError: # pragma: no cover
import charset_normalizer as chardet


__all__ = ("ClientRequest", "ClientResponse", "RequestInfo", "Fingerprint")

Expand Down Expand Up @@ -686,7 +682,7 @@ class ClientResponse(HeadersMixin):
_raw_headers: RawHeaders = None # type: ignore[assignment]

_connection = None # current connection
_source_traceback = None
_source_traceback: Optional[traceback.StackSummary] = None
# set up by ClientRequest after ClientResponse object creation
# post-init stage allows to not change ctor signature
_closed = True # to allow __del__ for non-initialized properly response
Expand Down Expand Up @@ -725,6 +721,15 @@ def __init__(
self._loop = loop
# store a reference to session #1985
self._session: Optional[ClientSession] = session
# Save reference to _resolve_charset, so that get_encoding() will still
# work after the response has finished reading the body.
if session is None:
# TODO: Fix session=None in tests (see ClientRequest.__init__).
self._resolve_charset: Callable[
["ClientResponse", bytes], str
] = lambda *_: "utf-8"
else:
self._resolve_charset = session._resolve_charset
if loop.get_debug():
self._source_traceback = traceback.extract_stack(sys._getframe(1))

Expand Down Expand Up @@ -1012,27 +1017,22 @@ def get_encoding(self) -> str:

encoding = mimetype.parameters.get("charset")
if encoding:
try:
codecs.lookup(encoding)
except LookupError:
encoding = None
if not encoding:
if mimetype.type == "application" and (
mimetype.subtype == "json" or mimetype.subtype == "rdap"
):
# RFC 7159 states that the default encoding is UTF-8.
# RFC 7483 defines application/rdap+json
encoding = "utf-8"
elif self._body is None:
raise RuntimeError(
"Cannot guess the encoding of " "a not yet read body"
)
else:
encoding = chardet.detect(self._body)["encoding"]
Dreamsorcerer marked this conversation as resolved.
Show resolved Hide resolved
if not encoding:
encoding = "utf-8"
with contextlib.suppress(LookupError):
return codecs.lookup(encoding).name

if mimetype.type == "application" and (
mimetype.subtype == "json" or mimetype.subtype == "rdap"
):
# RFC 7159 states that the default encoding is UTF-8.
# RFC 7483 defines application/rdap+json
return "utf-8"

if self._body is None:
raise RuntimeError(
"Cannot compute fallback encoding of a not yet read body"
)

return encoding
return self._resolve_charset(self, self._body)

async def text(self, encoding: Optional[str] = None, errors: str = "strict") -> str:
"""Read response payload and decode."""
Expand Down
5 changes: 0 additions & 5 deletions docs/_snippets/cchardet-unmaintained-admonition.rst

This file was deleted.

30 changes: 30 additions & 0 deletions docs/client_advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -740,3 +740,33 @@ HTTP Pipelining
---------------

aiohttp does not support HTTP/HTTPS pipelining.


Character Set Detection
-----------------------

If you encounter a :exc:`UnicodeDecodeError` when using :meth:`ClientResponse.text()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work? The correct syntax AFAIK doesn't include ().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I actually copied from the section above, which makes the same mistake with .close(). I wonder why Sphinx doesn't produce a warning...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. Maybe because of this ignore https://github.com/aio-libs/aiohttp/blob/6755796/docs/conf.py#L384 or some Sphinx but, or a corner case. Somebody needs to go through all those ignores and fix them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dreamsorcerer so it looks like we don't run the normal docs build @ GHA and spellcheck that we do run doesn't emit those warnings: https://github.com/aio-libs/aiohttp/actions/runs/6018473896/job/16326723444#step:13:18.

OTOH, RTD does have those warnings in the log but isn't set up to turn them into errors: https://readthedocs.org/projects/aiohttp/builds/21761448/.

I think the CI was set up to fail on warnings at some point. Maybe that got removed, or I'm just confusing the repos…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, other warnings seem to fail CI in the docs spelling step (I'm not aware of another part that produces warnings):
https://github.com/aio-libs/aiohttp/actions/runs/6017616744/job/16324167311#step:13:86

this may be because the response does not include the charset needed
to decode the body.

If you know the correct encoding for a request, you can simply specify
the encoding as a parameter (e.g. ``resp.text("windows-1252")``).

Alternatively, :class:`ClientSession` accepts a ``fallback_charset_resolver`` parameter which
can be used to introduce charset guessing functionality. When a charset is not found
in the Content-Type header, this function will be called to get the charset encoding. For
example, this can be used with the ``chardetng_py`` library.::

from chardetng_py import detect

def charset_resolver(resp: ClientResponse, body: bytes) -> str:
tld = resp.url.host.rsplit(".", maxsplit=1)[-1]
Dreamsorcerer marked this conversation as resolved.
Show resolved Hide resolved
return detect(body, allow_utf8=True, tld=tld)

ClientSession(fallback_charset_resolver=charset_resolver)

Or, if ``chardetng_py`` doesn't work for you, then ``charset-normalizer`` is another option::

from charset_normalizer import detect

ClientSession(fallback_charset_resolver=lamba r, b: detect(b)["encoding"] or "utf-8")
59 changes: 22 additions & 37 deletions docs/client_reference.rst
Dreamsorcerer marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ The client session supports the context manager protocol for self closing.
read_bufsize=2**16, \
requote_redirect_url=True, \
trust_env=False, \
trace_configs=None)
trace_configs=None, \
fallback_charset_resolver=lambda r, b: "utf-8")

The class for creating client sessions and making requests.

Expand Down Expand Up @@ -208,6 +209,16 @@ The client session supports the context manager protocol for self closing.
disabling. See :ref:`aiohttp-client-tracing-reference` for
more information.

:param Callable[[ClientResponse,bytes],str] fallback_charset_resolver:
Dreamsorcerer marked this conversation as resolved.
Show resolved Hide resolved
A :term:`callable` that accepts a :class:`ClientResponse` and the
:class:`bytes` contents, and returns a :class:`str` which will be used as
the encoding parameter to :meth:`bytes.decode()`.

This function will be called when the charset is not known (e.g. not specified in the
Content-Type header). The default function simply defaults to ``utf-8``.

.. versionadded:: 3.8.6

.. attribute:: closed

``True`` if the session has been closed, ``False`` otherwise.
Expand Down Expand Up @@ -1406,12 +1417,8 @@ Response object
Read response's body and return decoded :class:`str` using
specified *encoding* parameter.

If *encoding* is ``None`` content encoding is autocalculated
using ``Content-Type`` HTTP header and *charset-normalizer* tool if the
header is not provided by server.

:term:`cchardet` is used with fallback to :term:`charset-normalizer` if
*cchardet* is not available.
If *encoding* is ``None`` content encoding is determined from the
Content-Type header, or using the ``fallback_charset_resolver`` function.

Close underlying connection if data reading gets an error,
release connection otherwise.
Expand All @@ -1420,35 +1427,21 @@ Response object
``None`` for encoding autodetection
(default).

:return str: decoded *BODY*

:raise LookupError: if the encoding detected by cchardet is
unknown by Python (e.g. VISCII).

.. note::
:raises: :exc:`UnicodeDecodeError` if decoding fails. See also
:meth:`get_encoding`.

If response has no ``charset`` info in ``Content-Type`` HTTP
header :term:`cchardet` / :term:`charset-normalizer` is used for
content encoding autodetection.

It may hurt performance. If page encoding is known passing
explicit *encoding* parameter might help::

await resp.text('ISO-8859-1')
:return str: decoded *BODY*

.. method:: json(*, encoding=None, loads=json.loads, \
content_type='application/json')
:async:

Read response's body as *JSON*, return :class:`dict` using
specified *encoding* and *loader*. If data is not still available
a ``read`` call will be done,
a ``read`` call will be done.

If *encoding* is ``None`` content encoding is autocalculated
using :term:`cchardet` or :term:`charset-normalizer` as fallback if
*cchardet* is not available.

if response's `content-type` does not match `content_type` parameter
If response's `content-type` does not match `content_type` parameter
:exc:`aiohttp.ContentTypeError` get raised.
To disable content type check pass ``None`` value.

Expand Down Expand Up @@ -1480,17 +1473,9 @@ Response object

.. method:: get_encoding()

Automatically detect content encoding using ``charset`` info in
``Content-Type`` HTTP header. If this info is not exists or there
are no appropriate codecs for encoding then :term:`cchardet` /
:term:`charset-normalizer` is used.

Beware that it is not always safe to use the result of this function to
decode a response. Some encodings detected by cchardet are not known by
Python (e.g. VISCII). *charset-normalizer* is not concerned by that issue.

:raise RuntimeError: if called before the body has been read,
for :term:`cchardet` usage
Retrieve content encoding using ``charset`` info in ``Content-Type`` HTTP header.
If no charset is present or the charset is not understood by Python, the
``fallback_charset_resolver`` function associated with the ``ClientSession`` is called.

.. versionadded:: 3.0

Expand Down
16 changes: 0 additions & 16 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,22 +45,6 @@
Any object that can be called. Use :func:`callable` to check
that.

charset-normalizer

The Real First Universal Charset Detector.
Open, modern and actively maintained alternative to Chardet.

https://pypi.org/project/charset-normalizer/

cchardet

cChardet is high speed universal character encoding detector -
binding to charsetdetect.

https://pypi.python.org/pypi/cchardet/

.. include:: _snippets/cchardet-unmaintained-admonition.rst

gunicorn

Gunicorn 'Green Unicorn' is a Python WSGI HTTP Server for
Expand Down
24 changes: 3 additions & 21 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,6 @@ Library Installation

$ pip install aiohttp

You may want to install *optional* :term:`cchardet` library as faster
replacement for :term:`charset-normalizer`:

.. code-block:: bash

$ pip install cchardet

.. include:: _snippets/cchardet-unmaintained-admonition.rst

For speeding up DNS resolving by client API you may install
:term:`aiodns` as well.
This option is highly recommended:
Expand All @@ -53,9 +44,9 @@ This option is highly recommended:
Installing all speedups in one command
--------------------------------------

The following will get you ``aiohttp`` along with :term:`cchardet`,
:term:`aiodns` and ``Brotli`` in one bundle. No need to type
separate commands anymore!
The following will get you ``aiohttp`` along with :term:`aiodns` and ``Brotli`` in one
bundle.
No need to type separate commands anymore!

.. code-block:: bash

Expand Down Expand Up @@ -157,17 +148,8 @@ Dependencies
============

- *async_timeout*
- *charset-normalizer*
- *multidict*
- *yarl*
- *Optional* :term:`cchardet` as faster replacement for
:term:`charset-normalizer`.

Install it explicitly via:

.. code-block:: bash

$ pip install cchardet

- *Optional* :term:`aiodns` for fast DNS resolving. The
library is highly recommended.
Expand Down
Loading