Skip to content

[BUG] identifies UTF16LE for a pair of ascii punctuation characters #509

Closed
@GavinHuttley

Description

Describe the bug
Introducing conventional ascii text returns UTF-16LE encoding

To Reproduce

import chardet, charset_normalizer

charset_normalizer.detect(b");")  # error also happens with b"(;"
# returns  {'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}
chardet.detect(b");")
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Expected behavior

These are standard ASCII characters, I expect a UTF-8 encoding

Desktop (please complete the following information):

  • macOS 14.5
  • Python version 3.12.1 (anaconda build)
  • charset_normalizer version 3.3.2

Additional context
Evaluate either b"(", b")", b";" or b"()" produces the expected result. There are other combinations of punctuation characters that produce the same error, e.g. b".;".

I understand this is a very small string but perhaps a default to the minimum character set?

Metadata

Assignees

No one assigned

    Labels

    detectionRelated to the charset detection mechanism, chaos/mess/coherence

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions