|
| 1 | +<!--Copyright 2020 The HuggingFace Team. All rights reserved. |
| 2 | +
|
| 3 | +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
| 4 | +the License. You may obtain a copy of the License at |
| 5 | +
|
| 6 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 7 | +
|
| 8 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
| 9 | +an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| 10 | +specific language governing permissions and limitations under the License. |
| 11 | +
|
| 12 | +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be |
| 13 | +rendered properly in your Markdown viewer. |
| 14 | +
|
| 15 | +--> |
| 16 | + |
| 17 | +# 使用 🤗 Tokenizers 中的分词器 |
| 18 | + |
| 19 | +[`PreTrainedTokenizerFast`] 依赖于 [🤗 Tokenizers](https://huggingface.co/docs/tokenizers) 库。从 🤗 Tokenizers 库获得的分词器可以被轻松地加载到 🤗 Transformers 中。 |
| 20 | + |
| 21 | +在了解具体内容之前,让我们先用几行代码创建一个虚拟的分词器: |
| 22 | + |
| 23 | +```python |
| 24 | +>>> from tokenizers import Tokenizer |
| 25 | +>>> from tokenizers.models import BPE |
| 26 | +>>> from tokenizers.trainers import BpeTrainer |
| 27 | +>>> from tokenizers.pre_tokenizers import Whitespace |
| 28 | + |
| 29 | +>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) |
| 30 | +>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) |
| 31 | + |
| 32 | +>>> tokenizer.pre_tokenizer = Whitespace() |
| 33 | +>>> files = [...] |
| 34 | +>>> tokenizer.train(files, trainer) |
| 35 | +``` |
| 36 | + |
| 37 | +现在,我们拥有了一个针对我们定义的文件进行训练的分词器。我们可以在当前运行时中继续使用它,或者将其保存到一个 JSON 文件以供将来重复使用。 |
| 38 | + |
| 39 | +## 直接从分词器对象加载 |
| 40 | + |
| 41 | +让我们看看如何利用 🤗 Transformers 库中的这个分词器对象。[`PreTrainedTokenizerFast`] 类允许通过接受已实例化的 *tokenizer* 对象作为参数,进行轻松实例化: |
| 42 | + |
| 43 | +```python |
| 44 | +>>> from transformers import PreTrainedTokenizerFast |
| 45 | + |
| 46 | +>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) |
| 47 | +``` |
| 48 | + |
| 49 | +现在可以使用这个对象,使用 🤗 Transformers 分词器共享的所有方法!前往[分词器页面](main_classes/tokenizer)了解更多信息。 |
| 50 | + |
| 51 | +## 从 JSON 文件加载 |
| 52 | + |
| 53 | +为了从 JSON 文件中加载分词器,让我们先保存我们的分词器: |
| 54 | + |
| 55 | +```python |
| 56 | +>>> tokenizer.save("tokenizer.json") |
| 57 | +``` |
| 58 | + |
| 59 | +我们保存此文件的路径可以通过 `tokenizer_file` 参数传递给 [`PreTrainedTokenizerFast`] 初始化方法: |
| 60 | + |
| 61 | +```python |
| 62 | +>>> from transformers import PreTrainedTokenizerFast |
| 63 | + |
| 64 | +>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") |
| 65 | +``` |
| 66 | + |
| 67 | +现在可以使用这个对象,使用 🤗 Transformers 分词器共享的所有方法!前往[分词器页面](main_classes/tokenizer)了解更多信息。 |
0 commit comments