Tags · fstudio/uchardet

v0.0.6

Version 0.0.6 released.

- Improve ASCII and ISO-8859-1 detection.
- Improve language models: Greek, Hungarian.
- New supports:
  * Arabic - ISO-8859-6 and Windows-1256.
  * Danish - Windows-1252, ISO-8859-1 and ISO-8859-15.
  * Spanish - ISO-8859-1, ISO-8859-15 and Windows-1252.
  * Vietnamese - VISCII and Windows-1258.
- Improve single-byte encoding detection algorithm by giving more weight
  to "probable" sequences (less frequent than "positive" sequence, yet
  not "negative").
- `uchardet` command line tool improved:
  * exits with non-zero return values on error.
- CMake build improved with more options:
  * Binary can be installed to non-default dir.
  * Allow building static-only builds.
  * Allow not building the command line tool.
  * Add static lib destination.

Jul 19, 2016
8a8d6b6
zip
tar.gz

v0.0.5

Version 0.0.5 released.

- Revert UTF-16 and UTF-32 label change:
  it was an error to specify endianness for texts with BOM.
  The Unicode standard explicitly warns against it, and it actually
  even (partially) break conversions.
- Added supports:
    - French: Windows-1252.
    - German: ISO-8859-1, Windows-1252
    - Esperanto: ISO-8859-3
    - Turkish: ISO-8859-3 and ISO-8859-9
    - Thai: ISO-8859-11 (and TIS-620 model rebuilt).
- Single Byte charset detection algorithm improved:
  detection of control characters lowers confidence.

Dec 5, 2015
886e03a
zip
tar.gz

v0.0.4

Version 0.0.4 released.

- Add support of ISO-8859-1 and ISO-8859-15 for French.
- Re-enable Hungarian language models (ISO-8859-2 and Windows-1250)
  which used to conflict with other charsets (should be better now).
- Differentiate ASCII detection and detection failure.
- Improve single-byte charset detection confidence algorithm (fixes for
  instance Windows-1251 Russian text detection).
- "UTF-16" is now outputted with endianness information (UTF-16LE/BE).
- Add UTF-32 BOM detection.
- Discard single byte charsets upon illegal codepoint detection.
- Internal redesign of single-byte charmaps with more semantics, and
  variable sample size length (different languages have different sizes
  of grapheme lists).
- A lot more test files (33 successful unit tests should be successful
  with `make test`).
- Adding python scripts to generate language models from Wikipedia data
  in a single command.

Dec 3, 2015
e4260f4
zip
tar.gz

v0.0.3

Version 0.0.3 Released.

A quick release after 0.0.2 mostly to fix a bad crash on the command
line tool when charset detection failed (or detected ASCII).

Additionaly:

- The build now includes more test files for various language/encoding
  and a `make test` target for unit testing (20 encoding detection tests
  should be successful upon running it).
- The build has a new BUILD_STATIC option, by default set to ON,
  allowing to disable static library building if not needed.
- All encoding names are iconv-compatible, enabling developers to
  directly feed the result of uchardet_get_charset() into libiconv.
- Compilation warnings fixed.

Nov 19, 2015
ff5fd5e
zip
tar.gz

v0.0.2

Version 0.0.2 released.

Nov 16, 2015
d0ccdd5
zip
tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.6

v0.0.5

v0.0.4

v0.0.3

v0.0.2

Tags: fstudio/uchardet

v0.0.6

v0.0.5

v0.0.4

v0.0.3

v0.0.2