Skip to content

Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38

Closed
@fretman92

Description

@fretman92

Test program launched from latest source:

            string filename = args[0];

            var result = CharsetDetector.DetectFromFile(filename);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromFile - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

            byte[] bytes = System.IO.File.ReadAllBytes(filename);
            result = CharsetDetector.DetectFromBytes(bytes);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromBytes - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

            System.IO.Stream fileStream = System.IO.File.OpenRead(filename);
            result = CharsetDetector.DetectFromStream(fileStream);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromStream - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

Result:
image

The file is a HTML UTF-8 (without BOM) encoded file containing 1 simple emoji : 😀
(attached in the zip below)
utf8_with_emoji.zip

Why does the DetectFromBytes method gives a different result?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions