Skip to content

0.3.0

tagged this 27 Nov 22:16
Summary:
Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1171

The existing GPT2BPETokenizer incorrectly calculates the start and end indices for unicode characters.
This is because for multi-byte characters, we need to additionally use the byte decoder on the decoded bytes to get back the original token that was encoded.

Reviewed By: chenyangyu1988

Differential Revision: D18697646

fbshipit-source-id: 8f4d32a1caa40d8d06e7be31dfd4a6846692531a
Assets 2
Loading