Problem in align two tokenizers #5323
-
Thank you for providing the align function to align two tokenizers. I tried the given function as follows: import spacy
from transformers import *
nlp = spacy.load('en_core_web_sm')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = '15.10 CRITERIA FOR REMOVABLE PROSTHODONTICS (FULL AND PARTIAL DENTURES) ............53'
a = bert_tokenizer.tokenize(sentence)
b = [token.text for token in nlp(sentence)]
print(a)
print(b)
print(align(a, b)) The results look like this: ['15', '.', '10', 'criteria', 'for', 're', '##movable', 'pro', '##st', '##ho', '##don', '##tics', '(', 'full', 'and', 'partial', 'dent', '##ures', ')', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '53']
['15.10', 'CRITERIA', 'FOR', 'REMOVABLE', 'PROSTHODONTICS', '(', 'FULL', 'AND', 'PARTIAL', 'DENTURES', ')', '............', '53']
(24, array([-1, -1, -1, 1, 2, -1, -1, -1, -1, -1, -1, -1, 5, 6, 7, 8, -1,
-1, 10, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 12],
dtype=int32), array([ 2, 3, 4, -1, -1, 12, 13, 14, 15, -1, 18, -1, 31], dtype=int32), {0: 0, 1: 0, 2: 0}, {}) From the result we can see that spacy tokens 'REMOVABLE', 'PROSTHODONTICS', 'DENTURES' do not allign well with bert tokens. Is this bug? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I'm assuming this is
Output:
However, this method was intended for internal use for a narrow range of corpus alignment tasks and has some known bugs in some of the one-to-many cases. (For this case it will only work correctly with If you need a simple align function, I would recommend the new reimplemented align function, which is currently available behind a flag for testing. To use it, set Otherwise, depending on what you're doing, you might want to let |
Beta Was this translation helpful? Give feedback.
I'm assuming this is
spacy.gold.align()
? This method is not meant for aligning tokens with extra symbols like"##"
. If you strip the symbols temporarily, you can align the texts:Output:
However, this method …