Skip to content

Problem in align two tokenizers #5323

Discussion options

You must be logged in to vote

I'm assuming this is spacy.gold.align()? This method is not meant for aligning tokens with extra symbols like "##". If you strip the symbols temporarily, you can align the texts:

temp_a = [re.sub("^##", "", t) for t in a]
print(align(temp_a, b))

Output:

(24, array([-1, -1, -1,  1,  2, -1, -1, -1, -1, -1, -1, -1,  5,  6,  7,  8, -1,
       -1, 10, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 12],
      dtype=int32), array([ 2,  3,  4,  6, 11, 12, 13, 14, 15, 17, 18, 30, 31], dtype=int32), {0: 0, 1: 0, 2: 0, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 4, 16: 9, 17: 9, 19: 11, 20: 11, 21: 11, 22: 11, 23: 11, 24: 11, 25: 11, 26: 11, 27: 11, 28: 11, 29: 11, 30: 11}, {})

However, this method …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #5323 on December 11, 2020 00:19.