Skip to content

Commit 732d2a8

Browse files
authored
[i18n-ZH] Translated fast_tokenizers.md to Chinese (#26910)
docs: translate fast_tokenizers into Chinese
1 parent eec5a3a commit 732d2a8

File tree

2 files changed

+72
-1
lines changed

2 files changed

+72
-1
lines changed

docs/source/zh/_toctree.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,8 @@
99
- sections:
1010
- local: accelerate
1111
title: 加速分布式训练
12-
title: 教程
12+
title: 教程
13+
- sections:
14+
- local: fast_tokenizers
15+
title: 使用 🤗 Tokenizers 中的分词器
16+
title: 开发者指南

docs/source/zh/fast_tokenizers.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# 使用 🤗 Tokenizers 中的分词器
18+
19+
[`PreTrainedTokenizerFast`] 依赖于 [🤗 Tokenizers](https://huggingface.co/docs/tokenizers) 库。从 🤗 Tokenizers 库获得的分词器可以被轻松地加载到 🤗 Transformers 中。
20+
21+
在了解具体内容之前,让我们先用几行代码创建一个虚拟的分词器:
22+
23+
```python
24+
>>> from tokenizers import Tokenizer
25+
>>> from tokenizers.models import BPE
26+
>>> from tokenizers.trainers import BpeTrainer
27+
>>> from tokenizers.pre_tokenizers import Whitespace
28+
29+
>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
30+
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
31+
32+
>>> tokenizer.pre_tokenizer = Whitespace()
33+
>>> files = [...]
34+
>>> tokenizer.train(files, trainer)
35+
```
36+
37+
现在,我们拥有了一个针对我们定义的文件进行训练的分词器。我们可以在当前运行时中继续使用它,或者将其保存到一个 JSON 文件以供将来重复使用。
38+
39+
## 直接从分词器对象加载
40+
41+
让我们看看如何利用 🤗 Transformers 库中的这个分词器对象。[`PreTrainedTokenizerFast`] 类允许通过接受已实例化的 *tokenizer* 对象作为参数,进行轻松实例化:
42+
43+
```python
44+
>>> from transformers import PreTrainedTokenizerFast
45+
46+
>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
47+
```
48+
49+
现在可以使用这个对象,使用 🤗 Transformers 分词器共享的所有方法!前往[分词器页面](main_classes/tokenizer)了解更多信息。
50+
51+
## 从 JSON 文件加载
52+
53+
为了从 JSON 文件中加载分词器,让我们先保存我们的分词器:
54+
55+
```python
56+
>>> tokenizer.save("tokenizer.json")
57+
```
58+
59+
我们保存此文件的路径可以通过 `tokenizer_file` 参数传递给 [`PreTrainedTokenizerFast`] 初始化方法:
60+
61+
```python
62+
>>> from transformers import PreTrainedTokenizerFast
63+
64+
>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
65+
```
66+
67+
现在可以使用这个对象,使用 🤗 Transformers 分词器共享的所有方法!前往[分词器页面](main_classes/tokenizer)了解更多信息。

0 commit comments

Comments
 (0)