This does not match behavior of Huggingface's Python version 

I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET
1. tokenizer.Encode should stop when sequenceLength is reached instead of throwing exception. It's not always known what sequence length is going to be. 
2. tokenizer.Encode takes array of strings. Python version returns array of arrays (long[][]) of InputIds. Your version returns long[] with array flattened.

Example. Python code: 
from transformers import BertTokenizer
MODEL_NAME = "distilbert-base-uncased" 

sentence1 = "George Is the best person ever";
sentence2 = "George Is the best person ever";
sentence3 = "George Is the best person ever";
sentences = [sentence1, sentence2, sentence3]

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME) 
train_encodings = tokenizer(sentences, truncation=True, padding=True, max_length=512)
print(train_encodings)


Output:
{'input_ids': [[101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

C# code:
var tokenizer = new BertUncasedBaseTokenizer();
var sentence1 = "George Is the best persone ever";
var sentence2 = "George Is the best persone ever";
var sentence3 = "George Is the best persone ever";
var sentenses = new string[] { sentence1, sentence2, sentence3 };
var encoded = tokenizer.Encode(30, sentenses);

typeof(encoded) = System.Collections.Generic.List<(long InputIds, long TokenTypeIds, long AttentionMask)>(Count = 30)
I would have expected List<List<(long InputIds, long TokenTypeIds, long AttentionMask)>>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

This does not match behavior of Huggingface's Python version #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

This does not match behavior of Huggingface's Python version #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions