-
Notifications
You must be signed in to change notification settings - Fork 0
Repurpose Scheduler Spec Dec metric for testing correctness #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
num_draft_tokens = scheduler_stats.spec_decoding_stats.num_draft_tokens | ||
num_accepted_tokens = scheduler_stats.spec_decoding_stats.num_accepted_tokens | ||
num_spec_proposal = num_draft_tokens / args.num_spec_tokens | ||
mean_accepted_tokens = 1 + num_accepted_tokens / num_spec_proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_spec_proposal
is the num of times the SD call was made
mean_accepted_tokens = (sum of generated tokens over num_spec_proposal) / num_spec_proposal
= (num_spec_proposal + sum of accepted tokens over num_spec_proposal) / num_spec_proposal
= 1 + num_accepted_tokens / num_spec_proposal
# spec_decoding_stats: Optional[SpecDecodingStats] = None | ||
spec_decoding_stats = self.spec_decoding_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache the spec_decoding_stats
so that it keeps a running metric instead of reinit it every engine step
@@ -48,7 +48,8 @@ def main(): | |||
args = parser.parse_args() | |||
|
|||
model_dir = "meta-llama/Meta-Llama-3-8B-Instruct" | |||
eagle_dir = "abhigoyal/EAGLE-LLaMA3-Instruct-8B-vllm" | |||
# eagle_dir = "yuhuili/EAGLE-LLaMA3-Instruct-8B" | |||
eagle_dir = "lmsys/sglang-EAGLE-LLaMA3-Instruct-8B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using sglang model so that the prev SGL bench is comparable: https://docs.google.com/document/d/18ETJLsnxR88Qq3VDk5Mq-Hb7vuE9o3VNZ-hhz-OqAXk/edit?usp=sharing
Hmm, we discussed this on Slack shortly after you submitted this PR
a new SpecDecodingStats should only be created once per
Your response, for reference:
|
Hi @markmc - yup, we are good. I am still using this hacky PR whenever I want to quickly find the AL for my evals since vllm-project#16367 is still not merged |
I was looking into SD metrics in V1 and find that
spec_decoding_stats
is reinit every time we do an engine step and we use an observe function which from the name is supposed to aggregate over miltiple observe calls. However, since it's reinit everytime, we will always have 1 observe call and there is no aggregation.To enable AL computation for checking correctness, this PR aggregates the metrics across steps in the
EngineCoreOutputs.scheduler_stats