Skip to content

lemire/unicode_lipsum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

unicode_lipsum

Tests files encoded with UTF-8, UTF-16LE and UTF-32LE.

By convention, all UTF-8 files end with .utf8.txt while all UTF-16LE files end with .utf16.txt and all UTF-32LE end with .utf32.txt.

A small number of files are encoded using Latin 1 (ISO-8859-1): esperanto.latin1.txt, french.latin1.txt, german.latin1.txt, portuguese.latin1.txt in the wikipedia_mars directory. They are not exactly equivalent to the Unicode files: e.g., it is not possible to reproduce the equivalent Unicode files from the Latin 1 files. However, we have have modified Unicode files with the suffixes .utflatin8.txt (UTF-8 recovered from Latin 1), .utflatin16.txt (UTF-16LE recovered from Latin 1), .utflatin32.txt (UTF-32LE recovered from Latin 1).

The wikipedia_mars files are derived from the Mars wikipedia article in different languages. Wikipedia is licensed under a Creative Commons license. The html2text Python program is used to convert them to text, by stripping HTML codes.

The lipsum file come from the package https://github.com/rusticstuff/simdutf8 by Hans Kratz (licensed under both MIT and Apache).

These files are provided for research purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages