Tokenizer is a C++ implementation of OpenNMT tokenization and detokenization.
Compiling executables requires:
Boost
(program_options
)
CMake and a compiler that supports the C++11 standard are required to compile the project.
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=<Release or Debug> ..
make
It will produce the dynamic library libtokenizer.so
(or .dylib
on Mac OS, .dll
on Windows), and the tokenization tools cli/tokenize
and cli/detokenize
.
- To compile only the library, use the
-DLIB_ONLY=ON
flag.
See --help
on the clients to discover available options and usage. They have the same interface as their Lua counterpart.
This project is also a convenient way to apply OpenNMT tokenization in existing software.
See:
include/onmt/Tokenizer.h
to apply OpenNMT's tokenization and detokenization