Notes on versioning:
The project follows semantic versioning 2.0.0. The API covers the following symbols:
- C++
onmt::BPELearner
onmt::BPE
onmt::SPMLearner
onmt::SentencePiece
onmt::SpaceTokenizer
onmt::Tokenizer
onmt::unicode::*
- Python
pyonmttok.BPELearner
pyonmttok.SentencePieceLearner
pyonmttok.Tokenizer
v1.22.2 (2020-11-12)
- Do not require "none" tokenization mode for SentencePiece vocabulary restriction
v1.22.1 (2020-10-30)
- Fix error when enabling vocabulary restriction with SentencePiece and
spacer_annotate
is not explicitly set - Fix backward compatibility with Kangxi and Kanbun scripts (see
segment_alphabet
option)
v1.22.0 (2020-10-29)
- [C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a
std::shared_ptr
to make it outlive theTokenizer
instance.
- Add
set_random_seed
function to make subword regularization reproducible - [Python] Support serialization of
Token
instances - [C++] Add
Options
structure to configure tokenization options (Flags
can still be used for backward compatibility)
- Fix BPE vocabulary restriction when using
joiner_new
,spacer_annotate
, orspacer_new
(the previous implementation always assumedjoiner_annotate
was used) - [Python] Fix
spacer
argument name inToken
constructor - [C++] Fix ambiguous subword encoder ownership by using a
std::shared_ptr
v1.21.0 (2020-10-22)
- Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)
- Fix BPE vocabulary restriction when words have a leading or trailing joiner
- Raise an error when using a multi-character joiner and
support_prior_joiner
- [Python] Implement
__hash__
method ofpyonmttok.Token
objects to be consistent with the__eq__
implementation - [Python] Declare
pyonmttok.Tokenizer
arguments (exceptmode
) as keyword-only - [Python] Improve compatibility with Python 3.9
v1.20.0 (2020-09-24)
- The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in
Release
mode by default - Tests are no longer compiled by default (use
-DBUILD_TESTS=ON
to compile the tests)
- Accept any Unicode script aliases in the
segment_alphabet
option - Update SentencePiece to 0.1.92
- [Python] Improve the capabilities of the
Token
class:- Implement the
__repr__
method - Allow setting all attributes in the constructor
- Add a copy constructor
- Implement the
- [Python] Add a copy constructor for the
Tokenizer
class
- [Python] Accept
None
value forsegment_alphabet
argument
v1.19.0 (2020-09-02)
- Add BPE dropout (Provilkov et al. 2019)
- [Python] Introduce the "Token API": a set of methods that manipulate
Token
objects instead of serialized strings - [Python] Add
unicode_ranges
argument to thedetokenize_with_ranges
method to return ranges over Unicode characters instead of bytes
- Include "Half-width kana" in Katakana script detection
v1.18.5 (2020-07-07)
- Fix possible crash when applying a case insensitive BPE model on Unicode characters
v1.18.4 (2020-05-22)
- Fix segmentation fault on
cli/tokenize
exit - Ignore empty tokens during detokenization
- When writing to a file, avoid flushing the output stream on each line
- Update
cli/CMakeLists.txt
to mark Boost.ProgramOptions as required
v1.18.3 (2020-03-09)
- Strip token annotations when calling
SubwordLearner.ingest_token
v1.18.2 (2020-02-17)
- Speed and memory improvements for BPE learning
v1.18.1 (2020-01-16)
- [Python] Fix memory leak when deleting Tokenizer object
v1.18.0 (2020-01-06)
- Include
is_placeholder
function in the Python API - Add
ingest_token
method to learner objects to allow external tokenization
v1.17.2 (2019-12-06)
- Fix joiner annotation when SentencePiece returns isolated spacers
- Apply
preserve_segmented_tokens
in "none" tokenization mode - Performance improvements when using
case_feature
orcase_markup
- Add missing
--no_substitution
flag on the command line client
v1.17.1 (2019-11-28)
- Fix missing case features for isolated joiners or spacers
v1.17.0 (2019-11-13)
- Flag
soft_case_regions
to minimize the number of uppercase regions when usingcase_markup
- Fix mismatch between subword learning and encoding when using
case_feature
- [C++] Fix missing default value for new argument of constructor
SPMLearner
v1.16.1 (2019-10-21)
- Fix invalid SentencePiece training file when generated with
SentencePieceLearner.ingest
(newlines were missing) - Correctly ignore placeholders when using
SentencePieceLearner
without a tokenizer
v1.16.0 (2019-10-07)
- Support keeping the vocabulary generated by SentencePiece with the
keep_vocab
argument - [C++] Add intermediate method to annotate tokens before detokenization
- Improve file read/write errors detection
- [Python] Lower the risk of ABI incompatibilities with other pybind11 extensions
v1.15.7 (2019-09-20)
- Do not apply case modifiers on placeholder tokens
v1.15.6 (2019-09-16)
- Fix placeholder tokenization when followed by a combining mark
v1.15.5 (2019-09-16)
- [Python] Downgrade
pybind11
to fix segmentation fault when importing after non-compliant Python wheels
v1.15.4 (2019-09-14)
- [Python] Fix possible runtime error on program exit when using
SentencePieceLearner
v1.15.3 (2019-09-13)
- Fix possible memory issues when run in multiple threads with ICU
v1.15.2 (2019-09-11)
- [Python] Improve error checking in file based functions
v1.15.1 (2019-09-05)
- Fix regression in space tokenization: characters inside placeholders were incorrectly normalized
v1.15.0 (2019-09-05)
support_prior_joiners
flag to support tokenizing a pre-tokenized input
- Fix case markup when joiners or spacers are individual tokens
v1.14.1 (2019-08-07)
- Improve error checking
v1.14.0 (2019-07-19)
- [C++] Method to detokenize from
AnnotatedToken
s
- [Python] Release the GIL in time consuming functions (e.g. file tokenization, subword learning, etc.)
- Performance improvements
v1.13.0 (2019-06-12)
- [Python] File-based tokenization and detokenization APIs
- Support tokenizing files with multiple threads
- Respect "NoSubstitution" flag for combining marks applied on spaces
v1.12.1 (2019-05-27)
- Fix Python package
v1.12.0 (2019-05-27)
- Python API for subword learning (BPE and SentencePiece)
- C++ tokenization method to get the intermediate token representation
- Replace Boost.Python by pybind11 for the Python wrapper
- Fix verbose flag for SentencePiece training
- Check and raise possible errors during SentencePiece training
v1.11.0 (2019-02-05)
- Support copy operators on the Python client
- Support returning token locations in detokenized text
- Hide SentencePiece dependency in public headers
v1.10.6 (2019-01-15)
- Update SentencePiece to 0.1.8 in the Python package
- Allow naming positional arguments in the Python API
v1.10.5 (2019-01-03)
- More strict handle of combining marks - fixes #57 and #58
v1.10.4 (2018-12-18)
- Harden detokenization on invalid case markups combination
v1.10.3 (2018-11-05)
- Fix case markup for 1 letter words
v1.10.2 (2018-10-18)
- Fix compilations errors when SentencePiece is not installed
- Fix DLLs builds using Visual Studio
- Handle rare cases where SentencePiece returns 0 pieces
v1.10.1 (2018-10-08)
- Fix regression for SentencePiece: spacer annotation was not automatically enabled in tokenization mode "none"
v1.10.0 (2018-10-05)
CaseMarkup
flag to inject case information as new tokens
- Do not break compilation for users with old SentencePiece versions
v1.9.0 (2018-09-25)
- Vocabulary restriction for SentencePiece encoding
- Improve Tokenizer constructor for subword configuration
v1.8.4 (2018-09-24)
- Expose base methods in
Tokenizer
class - Small performance improvements for standard use cases
v1.8.3 (2018-09-18)
- Fix count of Arabic characters in the map of detected alphabets
v1.8.2 (2018-09-10)
- Minor fix to CMakeLists.txt for SentencePiece compilation
v1.8.1 (2018-09-07)
- Support training SentencePiece as a subtokenizer
v1.8.0 (2018-09-07)
- Add learning interface for SentencePiece
v1.7.0 (2018-09-04)
- Add integrated Subword Learning with first support of BPE.
- Preserve placeholders as independent tokens for all modes
v1.6.2 (2018-08-29)
- Support SentencePiece sampling API
- Additional +30% speedup for BPE tokenization
- Fix BPE not respecting
PreserveSegmentedTokens
(#30)
v1.6.1 (2018-07-31)
- Fix Python package
v1.6.0 (2018-07-30)
PreserveSegmentedTokens
flag to not attach joiners or spacers to tokens segmented by anySegment*
flags
- Do not rebuild
bpe_vocab
if already loaded (e.g. whenCacheModel
is set)
v1.5.3 (2018-07-13)
- Fix
PreservePlaceholders
withJoinerAnnotate
that possibly modified other tokens
v1.5.2 (2018-07-12)
- Fix support of BPE models v0.2 trained with
learn_bpe.py
v1.5.1 (2018-07-12)
- Do not escape spaces in placeholders value if
NoSubstitution
is enabled
v1.5.0 (2018-07-03)
- Support
apply_bpe.py
0.3 mode
- Up to x3 faster tokenization and detokenization
v1.4.0 (2018-06-13)
- New character level tokenization mode
Char
- Flag
SpacerNew
to make spacers independent tokens
- Replace spacer tokens by substitutes when found in the input text
- Do not enable spacers by default when SentencePiece is used as a subtokenizer
v1.3.0 (2018-04-07)
- New tokenization mode
None
that simply forwards the input text - Support SentencePiece, as a tokenizer or sub-tokenizer
- Flag
PreservePlaceholders
to not mark placeholders with joiners or spacers
- Revisit Python compilation to support wheels building
v1.2.0 (2018-03-28)
- Add API to retrieve discovered alphabet during tokenization
- Flag to convert joiners to spacers
- Add install target for the Python bindings library
v1.1.1 (2018-01-23)
- Make
Alphabet.h
public
v1.1.0 (2018-01-22)
- Python bindings
- Tokenization flag to disable special characters substitution
- Fix incorrect behavior when
--segment_alphabet
is not set by the client - Fix alphabet identification
- Fix segmentation fault when tokenizing empty string on spaces
v1.0.0 (2017-12-11)
- New
Tokenizer
constructor requiring bit flags
- Support BPE modes from
learn_bpe.lua
- Case insensitive BPE models
- Space tokenization mode
- Alphabet segmentation
- Do not tokenize blocks encapsulated by
⦅
and⦆
segment_numbers
flag to split numbers into digitssegment_case
flag to split words on case changessegment_alphabet_change
flag to split on alphabet changecache_bpe_model
flag to cache BPE models for future instances
- Fix
SpaceTokenizer
crash with leading or trailing spaces - Fix incorrect tokenization around tabulation character (#5)
- Fix incorrect joiner between numeric and punctuation
v0.2.0 (2017-03-08)
- Add CMake install rule
- Add API option to include separators
- Add static library compilation support
- Rename library to libOpenNMTTokenizer
- Make words features optional in tokenizer API
- Make
unicode
headers private
v0.1.0 (2017-02-14)
Initial release.