-
Notifications
You must be signed in to change notification settings - Fork 10
[Bugfix] Update profile example to new add request interface + fix profiler not picking up kernels within cudagraphs #332
Conversation
Hey @LucasWilkinson
I see 4 Gemms unders I am okay with landing this as-is and debugging this further. |
In eager mode we can uniquely identify GEMMs by using the calling module, when the
under Cuda-graphs this will be lumped into one aggregate:
line. You can see this by the number of invocations, for some of the GEMMs they are double. Unfortunately I have not figured out a way to correlate kernel calls from within a cudagraph back to a specific module, so its hard to separate these gemms out by layers when using cudagraphs. |
This is OK for what we need to do right now, fortunately |
examples/offline_profile.py
Outdated
128, # 128 to skip over special tokens | ||
llm.llm_engine.model_config.get_vocab_size() // 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a standard less hacky way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of, but it looks like there ignore_eos
option in SamplingParams
which can be set so we don't have to worry about special tokens (basically we were getting premature EOSs ending the profiling and restricting the vocab to not have special tokens fixed it, but ignore_eos
is a more robust way of solving this)
This PR updates
examples/offline_profile.py
to use the newadd_requests
interface. In addition, this PR fixes issues with profiler not picking up kernels run from within a cudagraph (i.e. when profiling with--allow-cuda-graphs
, there's is too main issues:model_runner.py
convertingCUDAGraphRunner
to ann.Module
allowing the profiler to pick it upBefore the PR:
After PR