Release 0.3.0: Start, End index calculations fix for unicode characters. (#1171) · fundou/pytext

0.3.0
fb64c42
Compare

Choose a tag to compare

Loading

View all tags

0.3.0: Start, End index calculations fix for unicode characters. (#1171)

0.3.0
fb64c42
Compare

Choose a tag to compare

Loading

View all tags

tagged this 27 Nov 22:16

Summary:
Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1171

The existing GPT2BPETokenizer incorrectly calculates the start and end indices for unicode characters.
This is because for multi-byte characters, we need to additionally use the byte decoder on the decoded bytes to get back the original token that was encoded.

Reviewed By: chenyangyu1988

Differential Revision: D18697646

fbshipit-source-id: 8f4d32a1caa40d8d06e7be31dfd4a6846692531a

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly