Skip to content

[FT] Faster generation with TransformersModel by using less padding #531

Closed
@rolshoven

Description

@rolshoven

Issue encountered

I noticed that the greedy_until function in TransformersModel uses excessive padding. In my case, I have a test set where my largest input has 27k tokens but most of the inputs are under 8k tokens. The current implementation uses max_context_continuation_size_allowed as the max_length in the tokenizer, which corresponds to the number of tokens for the largest samples in the entire dataset plus the maximum number of output tokens. This unnecessarily increases the evaluation time.

Solution/Feature

Instead of using max_context_continuation_size_allowed when tokenizing the batch contexts, it would be better to use something like this (untested):

largest_sample_in_batch = len(batch[0].tokenized_context) 
max_generation_size = batch[0].generation_size if batch[0].generation_size else self.max_length - largest_sample_in_batch
max_length = min(largest_sample_in_batch + max_generation_size, self.max_length)

tokenized = self.tokenizer(
   ...
    max_length=max_length   # Only this needs to change
   ...
).to(self.device)

The calculations are essentially the same as the ones being done already in the code, only that we don't look at the first sample in the entire dataset but the first sample in the batch for determining the max_length.

If you think this makes sense, I could open a pull request.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions