Skip to content

Improve performance of synthetic dataset prompt creation to match a given number of tokens #187

Open
@markurtz

Description

@markurtz

Currently, the synthetic dataset prompt creation process is using a binary search to converge on the desired prompt length based on the input prompt. This means that a large number of requests are being sent to the relatively expensive tokenize function for the processor. Recently, #162 was added, which enabled a significantly cheaper way to generate a prompt of a given length. This involves grabbing a prompt of a given length, tokenizing it, truncating the tokens array to the desired length, and reencoding it back to text. Provided we have a reasonable start multiplier (likely 3 or 4 tokens per word, which should be double-checked against the average for current tokenizers, or we dynamically calculate the tokens per word ratio as prompts are generated), then most will result in only a single tokenization call. If we add a safety that multiplies by a reasonable constant for the number of words to target if it doesn't meet the token length constraints after tokenizing, then we can guarantee proper convergence.

The test for this will be to ensure that the ratio of tokenize calls to prompts is approximately one and the average truncated number of tokens relative to the desired number of tokens is reasonably small (20%) while the number of tokens after running a benchmark through vLLM matches the number of prompt tokens desiredCurrently, our method for creating synthetic dataset prompts uses a binary search to establish the desired prompt length, leading to numerous requests to the expensive tokenization function.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions