Skip to content

This does not match behavior of Huggingface's Python version  #18

@gevorgter

Description

@gevorgter

I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET

  1. tokenizer.Encode should stop when sequenceLength is reached instead of throwing exception. It's not always known what sequence length is going to be.
  2. tokenizer.Encode takes array of strings. Python version returns array of arrays (long[][]) of InputIds. Your version returns long[] with array flattened.

Example. Python code:
from transformers import BertTokenizer
MODEL_NAME = "distilbert-base-uncased"

sentence1 = "George Is the best person ever";
sentence2 = "George Is the best person ever";
sentence3 = "George Is the best person ever";
sentences = [sentence1, sentence2, sentence3]

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
train_encodings = tokenizer(sentences, truncation=True, padding=True, max_length=512)
print(train_encodings)

Output:
{'input_ids': [[101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

C# code:
var tokenizer = new BertUncasedBaseTokenizer();
var sentence1 = "George Is the best persone ever";
var sentence2 = "George Is the best persone ever";
var sentence3 = "George Is the best persone ever";
var sentenses = new string[] { sentence1, sentence2, sentence3 };
var encoded = tokenizer.Encode(30, sentenses);

typeof(encoded) = System.Collections.Generic.List<(long InputIds, long TokenTypeIds, long AttentionMask)>(Count = 30)
I would have expected List<List<(long InputIds, long TokenTypeIds, long AttentionMask)>>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions