Closed
Description
Test program launched from latest source:
string filename = args[0];
var result = CharsetDetector.DetectFromFile(filename);
if (result.Detected != null)
{
Console.WriteLine("DetectFromFile - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
}
byte[] bytes = System.IO.File.ReadAllBytes(filename);
result = CharsetDetector.DetectFromBytes(bytes);
if (result.Detected != null)
{
Console.WriteLine("DetectFromBytes - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
}
System.IO.Stream fileStream = System.IO.File.OpenRead(filename);
result = CharsetDetector.DetectFromStream(fileStream);
if (result.Detected != null)
{
Console.WriteLine("DetectFromStream - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
}
The file is a HTML UTF-8 (without BOM) encoded file containing 1 simple emoji : 😀
(attached in the zip below)
utf8_with_emoji.zip
Why does the DetectFromBytes
method gives a different result?