Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FastTokenizer] Update fast_tokenizer doc #3787

Merged
merged 15 commits into from
Nov 17, 2022
Prev Previous commit
Next Next commit
Add README for ernie fast tokenizer
  • Loading branch information
joey12300 committed Nov 17, 2022
commit ac822258ca35814ea90d9ff81caa31c04604fafa
20 changes: 20 additions & 0 deletions fast_tokenizer/examples/ernie-3.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# ErnieFastTokenizer分词示例

FastTokenizer库在C++、Python端提供ErnieFastTokenizer接口,用户只需传入模型相应的词表即可调用该接口,完成高效分词操作。该接口底层使用`WordPiece`算法进行分词。针对`WordPiece`算法,FastTokenizer实现了"Fast WordPiece Tokenization"提出的基于`MinMaxMatch`的`FastWordPiece`算法。原有`WordPiece`算法的时间复杂度与序列长度为二次方关系,在对长文本进行分词操作时,时间开销比较大。而`FastWordPiece`算法通过`Aho–Corasick `算法将`WordPiece`算法的时间复杂度降低为与序列长度的线性关系,大大提升了分词效率。`ErnieFastTokenizer`除了支持ERNIE模型的分词以外,还支持其他基于`WordPiece`算法分词的模型,比如`BERT`, `TinyBert`等,详细的模型列表如下:

## 支持的模型列表

- ERNIE
- BERT
- TinyBERT
- ERNIE Gram
- ERNIE ViL

## 详细分词示例文档

[C++ 分词示例](./cpp)
[Python 分词示例](./python)

## 参考文献

- Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021