Skip to content

Fast and customizable text tokenization library with BPE and SentencePiece support

License

Notifications You must be signed in to change notification settings

OpenNMT/Tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

Tokenizer

Tokenizer is a C++ implementation of OpenNMT tokenization and detokenization.

Dependencies

Compiling executables requires:

  • Boost (program_options)

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=<Release or Debug> ..
make

It will produce the dynamic library libtokenizer.so (or .dylib on Mac OS, .dll on Windows), and the tokenization tools cli/tokenize and cli/detokenize.

Options

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Using

Clients

See --help on the clients to discover available options and usage. They have the same interface as their Lua counterpart.

Library

This project is also a convenient way to apply OpenNMT tokenization in existing software.

See:

  • include/onmt/Tokenizer.h to apply OpenNMT's tokenization and detokenization