Description
Currently, the synthetic dataset prompt creation process is using a binary search to converge on the desired prompt length based on the input prompt. This means that a large number of requests are being sent to the relatively expensive tokenize function for the processor. Recently, #162 was added, which enabled a significantly cheaper way to generate a prompt of a given length. This involves grabbing a prompt of a given length, tokenizing it, truncating the tokens array to the desired length, and reencoding it back to text. Provided we have a reasonable start multiplier (likely 3 or 4 tokens per word, which should be double-checked against the average for current tokenizers, or we dynamically calculate the tokens per word ratio as prompts are generated), then most will result in only a single tokenization call. If we add a safety that multiplies by a reasonable constant for the number of words to target if it doesn't meet the token length constraints after tokenizing, then we can guarantee proper convergence.
The test for this will be to ensure that the ratio of tokenize calls to prompts is approximately one and the average truncated number of tokens relative to the desired number of tokens is reasonably small (20%) while the number of tokens after running a benchmark through vLLM matches the number of prompt tokens desiredCurrently, our method for creating synthetic dataset prompts uses a binary search to establish the desired prompt length, leading to numerous requests to the expensive tokenization function.