Notes on versioning:

The project follows semantic versioning 2.0.0. The API covers the following symbols:

C++
- onmt::BPELearner
- onmt::BPE
- onmt::SPMLearner
- onmt::SentencePiece
- onmt::SpaceTokenizer
- onmt::Tokenizer
- onmt::unicode::*
Python
- pyonmttok.BPELearner
- pyonmttok.SentencePieceLearner
- pyonmttok.Tokenizer

[Unreleased]

New features

Fixes and improvements

v1.22.2 (2020-11-12)

Fixes and improvements

Do not require "none" tokenization mode for SentencePiece vocabulary restriction

v1.22.1 (2020-10-30)

Fixes and improvements

Fix error when enabling vocabulary restriction with SentencePiece and spacer_annotate is not explicitly set
Fix backward compatibility with Kangxi and Kanbun scripts (see segment_alphabet option)

v1.22.0 (2020-10-29)

Changes

[C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a std::shared_ptr to make it outlive the Tokenizer instance.

New features

Add set_random_seed function to make subword regularization reproducible
[Python] Support serialization of Token instances
[C++] Add Options structure to configure tokenization options (Flags can still be used for backward compatibility)

Fixes and improvements

Fix BPE vocabulary restriction when using joiner_new, spacer_annotate, or spacer_new (the previous implementation always assumed joiner_annotate was used)
[Python] Fix spacer argument name in Token constructor
[C++] Fix ambiguous subword encoder ownership by using a std::shared_ptr

v1.21.0 (2020-10-22)

New features

Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

Fix BPE vocabulary restriction when words have a leading or trailing joiner
Raise an error when using a multi-character joiner and support_prior_joiner
[Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
[Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
[Python] Improve compatibility with Python 3.9

v1.20.0 (2020-09-24)

Changes

The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in Release mode by default
- Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

Accept any Unicode script aliases in the segment_alphabet option
Update SentencePiece to 0.1.92
[Python] Improve the capabilities of the Token class:
- Implement the __repr__ method
- Allow setting all attributes in the constructor
- Add a copy constructor
[Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

[Python] Accept None value for segment_alphabet argument

v1.19.0 (2020-09-02)

New features

Add BPE dropout (Provilkov et al. 2019)
[Python] Introduce the "Token API": a set of methods that manipulate Token objects instead of serialized strings
[Python] Add unicode_ranges argument to the detokenize_with_ranges method to return ranges over Unicode characters instead of bytes

Fixes and improvements

Include "Half-width kana" in Katakana script detection

v1.18.5 (2020-07-07)

Fixes and improvements

Fix possible crash when applying a case insensitive BPE model on Unicode characters

v1.18.4 (2020-05-22)

Fixes and improvements

Fix segmentation fault on cli/tokenize exit
Ignore empty tokens during detokenization
When writing to a file, avoid flushing the output stream on each line
Update cli/CMakeLists.txt to mark Boost.ProgramOptions as required

v1.18.3 (2020-03-09)

Fixes and improvements

Strip token annotations when calling SubwordLearner.ingest_token

v1.18.2 (2020-02-17)

Fixes and improvements

Speed and memory improvements for BPE learning

v1.18.1 (2020-01-16)

Fixes and improvements

[Python] Fix memory leak when deleting Tokenizer object

v1.18.0 (2020-01-06)

New features

Include is_placeholder function in the Python API
Add ingest_token method to learner objects to allow external tokenization

v1.17.2 (2019-12-06)

Fixes and improvements

Fix joiner annotation when SentencePiece returns isolated spacers
Apply preserve_segmented_tokens in "none" tokenization mode
Performance improvements when using case_feature or case_markup
Add missing --no_substitution flag on the command line client

v1.17.1 (2019-11-28)

Fixes and improvements

Fix missing case features for isolated joiners or spacers

v1.17.0 (2019-11-13)

New features

Flag soft_case_regions to minimize the number of uppercase regions when using case_markup

Fixes and improvements

Fix mismatch between subword learning and encoding when using case_feature
[C++] Fix missing default value for new argument of constructor SPMLearner

v1.16.1 (2019-10-21)

Fixes and improvements

Fix invalid SentencePiece training file when generated with SentencePieceLearner.ingest (newlines were missing)
Correctly ignore placeholders when using SentencePieceLearner without a tokenizer

v1.16.0 (2019-10-07)

New features

Support keeping the vocabulary generated by SentencePiece with the keep_vocab argument
[C++] Add intermediate method to annotate tokens before detokenization

Fixes and improvements

Improve file read/write errors detection
[Python] Lower the risk of ABI incompatibilities with other pybind11 extensions

v1.15.7 (2019-09-20)

Fixes and improvements

Do not apply case modifiers on placeholder tokens

v1.15.6 (2019-09-16)

Fixes and improvements

Fix placeholder tokenization when followed by a combining mark

v1.15.5 (2019-09-16)

Fixes and improvements

[Python] Downgrade pybind11 to fix segmentation fault when importing after non-compliant Python wheels

v1.15.4 (2019-09-14)

Fixes and improvements

[Python] Fix possible runtime error on program exit when using SentencePieceLearner

v1.15.3 (2019-09-13)

Fixes and improvements

Fix possible memory issues when run in multiple threads with ICU

v1.15.2 (2019-09-11)

Fixes and improvements

[Python] Improve error checking in file based functions

v1.15.1 (2019-09-05)

Fixes and improvements

Fix regression in space tokenization: characters inside placeholders were incorrectly normalized

v1.15.0 (2019-09-05)

New features

support_prior_joiners flag to support tokenizing a pre-tokenized input

Fixes and improvements

Fix case markup when joiners or spacers are individual tokens

v1.14.1 (2019-08-07)

Fixes and improvements

Improve error checking

v1.14.0 (2019-07-19)

New features

[C++] Method to detokenize from AnnotatedTokens

Fixes and improvements

[Python] Release the GIL in time consuming functions (e.g. file tokenization, subword learning, etc.)
Performance improvements

v1.13.0 (2019-06-12)

New features

[Python] File-based tokenization and detokenization APIs
Support tokenizing files with multiple threads

Fixes and improvements

Respect "NoSubstitution" flag for combining marks applied on spaces

v1.12.1 (2019-05-27)

Fixes and improvements

Fix Python package

v1.12.0 (2019-05-27)

New features

Python API for subword learning (BPE and SentencePiece)
C++ tokenization method to get the intermediate token representation

Fixes and improvements

Replace Boost.Python by pybind11 for the Python wrapper
Fix verbose flag for SentencePiece training
Check and raise possible errors during SentencePiece training

v1.11.0 (2019-02-05)

New features

Support copy operators on the Python client
Support returning token locations in detokenized text

Fixes and improvements

Hide SentencePiece dependency in public headers

v1.10.6 (2019-01-15)

Fixes and improvements

Update SentencePiece to 0.1.8 in the Python package
Allow naming positional arguments in the Python API

v1.10.5 (2019-01-03)

Fixes and improvements

More strict handle of combining marks - fixes #57 and #58

v1.10.4 (2018-12-18)

Fixes and improvements

Harden detokenization on invalid case markups combination

v1.10.3 (2018-11-05)

Fixes and improvements

Fix case markup for 1 letter words

v1.10.2 (2018-10-18)

Fixes and improvements

Fix compilations errors when SentencePiece is not installed
Fix DLLs builds using Visual Studio
Handle rare cases where SentencePiece returns 0 pieces

v1.10.1 (2018-10-08)

Fixes and improvements

Fix regression for SentencePiece: spacer annotation was not automatically enabled in tokenization mode "none"

v1.10.0 (2018-10-05)

New features

CaseMarkup flag to inject case information as new tokens

Fixes and improvements

Do not break compilation for users with old SentencePiece versions

v1.9.0 (2018-09-25)

New features

Vocabulary restriction for SentencePiece encoding

Fixes and improvements

Improve Tokenizer constructor for subword configuration

v1.8.4 (2018-09-24)

Fixes and improvements

Expose base methods in Tokenizer class
Small performance improvements for standard use cases

v1.8.3 (2018-09-18)

Fixes and improvements

Fix count of Arabic characters in the map of detected alphabets

v1.8.2 (2018-09-10)

Fixes and improvements

Minor fix to CMakeLists.txt for SentencePiece compilation

v1.8.1 (2018-09-07)

Fixes and improvements

Support training SentencePiece as a subtokenizer

v1.8.0 (2018-09-07)

New features

Add learning interface for SentencePiece

v1.7.0 (2018-09-04)

New features

Add integrated Subword Learning with first support of BPE.

Fixes and improvements

Preserve placeholders as independent tokens for all modes

v1.6.2 (2018-08-29)

New features

Support SentencePiece sampling API

Fixes and improvements

Additional +30% speedup for BPE tokenization
Fix BPE not respecting PreserveSegmentedTokens (#30)

v1.6.1 (2018-07-31)

Fixes and improvements

Fix Python package

v1.6.0 (2018-07-30)

New features

PreserveSegmentedTokens flag to not attach joiners or spacers to tokens segmented by any Segment* flags

Fixes and improvements

Do not rebuild bpe_vocab if already loaded (e.g. when CacheModel is set)

v1.5.3 (2018-07-13)

Fixes and improvements

Fix PreservePlaceholders with JoinerAnnotate that possibly modified other tokens

v1.5.2 (2018-07-12)

Fixes and improvements

Fix support of BPE models v0.2 trained with learn_bpe.py

v1.5.1 (2018-07-12)

Fixes and improvements

Do not escape spaces in placeholders value if NoSubstitution is enabled

v1.5.0 (2018-07-03)

New features

Support apply_bpe.py 0.3 mode

Fixes and improvements

Up to x3 faster tokenization and detokenization

v1.4.0 (2018-06-13)

New features

New character level tokenization mode Char
Flag SpacerNew to make spacers independent tokens

Fixes and improvements

Replace spacer tokens by substitutes when found in the input text
Do not enable spacers by default when SentencePiece is used as a subtokenizer

v1.3.0 (2018-04-07)

New features

New tokenization mode None that simply forwards the input text
Support SentencePiece, as a tokenizer or sub-tokenizer
Flag PreservePlaceholders to not mark placeholders with joiners or spacers

Fixes and improvements

Revisit Python compilation to support wheels building

v1.2.0 (2018-03-28)

New features

Add API to retrieve discovered alphabet during tokenization
Flag to convert joiners to spacers

Fixes and improvements

Add install target for the Python bindings library

v1.1.1 (2018-01-23)

Fixes and improvements

Make Alphabet.h public

v1.1.0 (2018-01-22)

New features

Python bindings
Tokenization flag to disable special characters substitution

Fixes and improvements

Fix incorrect behavior when --segment_alphabet is not set by the client
Fix alphabet identification
Fix segmentation fault when tokenizing empty string on spaces

v1.0.0 (2017-12-11)

Breaking changes

New Tokenizer constructor requiring bit flags

New features

Support BPE modes from learn_bpe.lua
Case insensitive BPE models
Space tokenization mode
Alphabet segmentation
Do not tokenize blocks encapsulated by ｟ and ｠
segment_numbers flag to split numbers into digits
segment_case flag to split words on case changes
segment_alphabet_change flag to split on alphabet change
cache_bpe_model flag to cache BPE models for future instances

Fixes and improvements

Fix SpaceTokenizer crash with leading or trailing spaces
Fix incorrect tokenization around tabulation character (#5)
Fix incorrect joiner between numeric and punctuation

v0.2.0 (2017-03-08)

New features

Add CMake install rule
Add API option to include separators
Add static library compilation support

Fixes and improvements

Rename library to libOpenNMTTokenizer
Make words features optional in tokenizer API
Make unicode headers private

v0.1.0 (2017-02-14)

Initial release.

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

[Unreleased]

New features

Fixes and improvements

v1.22.2 (2020-11-12)

Fixes and improvements

v1.22.1 (2020-10-30)

Fixes and improvements

v1.22.0 (2020-10-29)

Changes

New features

Fixes and improvements

v1.21.0 (2020-10-22)

New features

Fixes and improvements

v1.20.0 (2020-09-24)

Changes

New features

Fixes and improvements

v1.19.0 (2020-09-02)

New features

Fixes and improvements

v1.18.5 (2020-07-07)

Fixes and improvements

v1.18.4 (2020-05-22)

Fixes and improvements

v1.18.3 (2020-03-09)

Fixes and improvements

v1.18.2 (2020-02-17)

Fixes and improvements

v1.18.1 (2020-01-16)

Fixes and improvements

v1.18.0 (2020-01-06)

New features

v1.17.2 (2019-12-06)

Fixes and improvements

v1.17.1 (2019-11-28)

Fixes and improvements

v1.17.0 (2019-11-13)

New features

Fixes and improvements

v1.16.1 (2019-10-21)

Fixes and improvements

v1.16.0 (2019-10-07)

New features

Fixes and improvements

v1.15.7 (2019-09-20)

Fixes and improvements

v1.15.6 (2019-09-16)

Fixes and improvements

v1.15.5 (2019-09-16)

Fixes and improvements

v1.15.4 (2019-09-14)

Fixes and improvements

v1.15.3 (2019-09-13)

Fixes and improvements

v1.15.2 (2019-09-11)

Fixes and improvements

v1.15.1 (2019-09-05)

Fixes and improvements

v1.15.0 (2019-09-05)

New features

Fixes and improvements

v1.14.1 (2019-08-07)

Fixes and improvements

v1.14.0 (2019-07-19)

New features

Fixes and improvements

v1.13.0 (2019-06-12)

New features

Fixes and improvements

v1.12.1 (2019-05-27)

Fixes and improvements

v1.12.0 (2019-05-27)

New features