Skip to content

file encoding detection bug #14434

Closed
Closed
@jpraet

Description

@jpraet
  • Gitea version (or commit ref): 1.13.1
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)

Description

I noticed an encoding detection bug:

https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml
This is UTF-8, but detected as Content-Type: text/plain; charset=iso-8859-1, which result in incorrect rendering of a special character: <name>Françoise Tomasetti</name> should be <name>Françoise Tomasetti</name>.

The encoding detection is done on a buffer consisting of the first 1024 bytes of the file.
The UTF-8 ç character consists of 2 bytes: https://www.fileformat.info/info/unicode/char/00e7/index.htm.
By coincidence, in this file the first byte of that character happens to be the 1024th byte in the file, causing the encoding detection to not recognize this byte buffer as valid UTF-8.

As a test I removed a section from the start of the file, and then it works fine:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/ok.xml

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions