0.3.0
tagged this
27 Nov 22:16
Summary: Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1171 The existing GPT2BPETokenizer incorrectly calculates the start and end indices for unicode characters. This is because for multi-byte characters, we need to additionally use the byte decoder on the decoded bytes to get back the original token that was encoded. Reviewed By: chenyangyu1988 Differential Revision: D18697646 fbshipit-source-id: 8f4d32a1caa40d8d06e7be31dfd4a6846692531a