file encoding detection bug

- Gitea version (or commit ref): 1.13.1
- Can you reproduce the bug at https://try.gitea.io:
  - [X] Yes (provide example URL)

## Description

I noticed an encoding detection bug:

https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml 
This is UTF-8, but detected as `Content-Type: text/plain; charset=iso-8859-1`, which result in incorrect rendering of a special character: `<name>FranÃ§oise Tomasetti</name>` should be `<name>Françoise Tomasetti</name>`.

The encoding detection is done on a buffer consisting of the first 1024 bytes of the file.
The UTF-8 ç character consists of 2 bytes: https://www.fileformat.info/info/unicode/char/00e7/index.htm.
By coincidence, in this file the first byte of that character happens to be the 1024th byte in the file, causing the encoding detection to not recognize this byte buffer as valid UTF-8.

As a test I removed a section from the start of the file, and then it works fine:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/ok.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

file encoding detection bug #14434

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

file encoding detection bug #14434

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions