Description
- Gitea version (or commit ref): 1.13.1
- Can you reproduce the bug at https://try.gitea.io:
- Yes (provide example URL)
Description
I noticed an encoding detection bug:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml
This is UTF-8, but detected as Content-Type: text/plain; charset=iso-8859-1
, which result in incorrect rendering of a special character: <name>Françoise Tomasetti</name>
should be <name>Françoise Tomasetti</name>
.
The encoding detection is done on a buffer consisting of the first 1024 bytes of the file.
The UTF-8 ç character consists of 2 bytes: https://www.fileformat.info/info/unicode/char/00e7/index.htm.
By coincidence, in this file the first byte of that character happens to be the 1024th byte in the file, causing the encoding detection to not recognize this byte buffer as valid UTF-8.
As a test I removed a section from the start of the file, and then it works fine:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/ok.xml