-
Notifications
You must be signed in to change notification settings - Fork 26
Description
I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET
- tokenizer.Encode should stop when sequenceLength is reached instead of throwing exception. It's not always known what sequence length is going to be.
- tokenizer.Encode takes array of strings. Python version returns array of arrays (long[][]) of InputIds. Your version returns long[] with array flattened.
Example. Python code:
from transformers import BertTokenizer
MODEL_NAME = "distilbert-base-uncased"
sentence1 = "George Is the best person ever";
sentence2 = "George Is the best person ever";
sentence3 = "George Is the best person ever";
sentences = [sentence1, sentence2, sentence3]
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
train_encodings = tokenizer(sentences, truncation=True, padding=True, max_length=512)
print(train_encodings)
Output:
{'input_ids': [[101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}
C# code:
var tokenizer = new BertUncasedBaseTokenizer();
var sentence1 = "George Is the best persone ever";
var sentence2 = "George Is the best persone ever";
var sentence3 = "George Is the best persone ever";
var sentenses = new string[] { sentence1, sentence2, sentence3 };
var encoded = tokenizer.Encode(30, sentenses);
typeof(encoded) = System.Collections.Generic.List<(long InputIds, long TokenTypeIds, long AttentionMask)>(Count = 30)
I would have expected List<List<(long InputIds, long TokenTypeIds, long AttentionMask)>>