Fast character set encoding detection using International Components for Unicode C++ library.
require 'open-uri' require 'uchardet' encoding = ICU::UCharsetDetector.detect open('http://google.jp').read encoding # => {:language=>"ja", :encoding=>"Shift_JIS", :confidence=>100}
From command line:
$ uchardet Usage: uchardet [options] file -l, --list Display list of detectable character sets. -s, --strip Strip HTML or XML markup before detection. -e, --encoding Hint the charset detector about possible encoding. -a, --all Show all matching encodings. -h, --help Show this help message. $ uchardet `which uchardet` ISO-8859-1 (confidence 60%)
ICU (International Components for Unicode):
on Mac OS X:
sudo port install icu
on Debian/Ubuntu
sudo apt-get install libicu-dev
sudo gem install uchardet
Copyright © 2009 Dmitri Goutnik, released under the MIT license.