Skip to content

Conversation

ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented Aug 26, 2025

I have been looking for datasets where Ngram is better than Eagle for exploring the idea of combining Ngram and EAGLE #18633. InstructCoder being an editing task was the go to dataset in vLLM for Ngram until I found that fixing the prompt made EAGLE quite strong and better than Ngram on InstructCoder dataset. An ideal dataset would be the one where the overlap bw input and output are high. Blazedit dataset is a promising one since it can allow observing AL of Ngram over different input-output overlap.

  • This PR add Blazedit dataset.
  • Compared to InstructCode dataset which we used for Ngram, it is a longer dataset and has associated each data with the normalized Levenshtein distance [0.0, 1.0] which can help with observing the gain of Ngram wrt to the overlap between input and output
  • Source
  • This needs model which can support >3k seq len for the 5k char variant of dataset. llama 3.1 8b only supports 2048 so couldn't run this dataset completely but will be useful for models which have longer seq len.

Sample Cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --tp 1 --dataset-name hf --dataset-path vdaita/edit_5k_char --num-prompts 90 --hf-output-len 2048 --blazedit-min-distance 0.01 --blazedit-max-distance 0.99 --print-output

@mergify mergify bot added the performance Performance-related issues label Aug 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for the Blitzedit dataset for benchmarking. The changes correctly add command-line arguments, integrate the new dataset class into the factory function, and implement the dataset loading and sampling logic. My review focuses on cleaning up some leftover debugging code and unused variables in the new BlazeditDataset implementation to improve code quality and ensure the dataset is not unnecessarily filtered.

@ekagra-ranjan ekagra-ranjan changed the title [Spec Dec][Benchmark] Add Blitzedit dataset [Spec Decode][Benchmark] Add Blitzedit dataset Sep 3, 2025
@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2025
@ywang96 ywang96 enabled auto-merge (squash) September 3, 2025 23:26
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
auto-merge was automatically disabled September 4, 2025 15:51

Head branch was pushed to by a user without write access

Copy link

mergify bot commented Sep 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 4, 2025
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Sep 4, 2025
@LiuXiaoxuanPKU
Copy link
Collaborator

Can you share some numbers of InstructCoder if any?

@ekagra-ranjan
Copy link
Contributor Author

I have some numbers here: #18971

@ywang96 ywang96 merged commit cd08636 into vllm-project:main Sep 8, 2025
38 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants