Skip to content

Error leads to early exit/failed detection in MBCSGroupProbers #162

Open
@Orden4

Description

@Orden4

Input file for reference:
Rocio.txt

UTF-unknown was unable to detect the correct encoding (Shift JIS), while uchardet did correctly identify it.
This took a while to figure out, but eventually I discovered that it was because of a few lines like these (line 98): victory3 ="Hay que salvar al mundo, ソte uniras a nosotras?".
Mugen character files are a thing of nightmare. I assume that this is a character made by someone Spanish/Brazilian, then edited by someone Japanese.

After some investigation I indeed found that these probers are practically identical to uchardet, but there is a discrepency that caused the results to deviate. Namely, UTF-unknown exits early when it encounters an error, while uchardet simply continues. As far as I could tell, uchardet never exits as a result of a state machine error, in any prober at all. And indeed, upon removing the early exits, I got a correct detection as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions