Tests files encoded with UTF-8, UTF-16LE and UTF-32LE.
By convention, all UTF-8 files end with .utf8.txt
while all UTF-16LE files end with .utf16.txt
and
all UTF-32LE end with .utf32.txt
.
A small number of files are encoded using Latin 1 (ISO-8859-1): esperanto.latin1.txt, french.latin1.txt, german.latin1.txt, portuguese.latin1.txt
in the wikipedia_mars directory. They are not exactly equivalent to the Unicode files: e.g., it is not possible to reproduce the equivalent Unicode files from the Latin 1 files. However, we have have modified Unicode files with the suffixes .utflatin8.txt
(UTF-8 recovered from Latin 1), .utflatin16.txt
(UTF-16LE recovered from Latin 1), .utflatin32.txt
(UTF-32LE recovered from Latin 1).
The wikipedia_mars files are derived from the Mars wikipedia article in different languages.
Wikipedia is licensed under a Creative Commons license.
The html2text
Python program is used to convert them to text, by stripping HTML codes.
The lipsum file come from the package https://github.com/rusticstuff/simdutf8 by Hans Kratz (licensed under both MIT and Apache).
These files are provided for research purposes.