Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is my submission based on the Discord message "To apply, submit a PR to this GitHub file which improves the tokens per seconds attainable on a H100 box with the model configurations from this script"
I tested this submission on a Vast AI PyTorch box with a H100 PCIe running Ubuntu 24.04 (noble). To configure the workspace, I used the following commands:
uv venv --python 3.11source .venv/bin/activateuv pip install -r requirements.txtuv pip install deepspeed-miiTo run the script, I used the following command:
deepspeed --num_gpus 1 deep.pyI attached the truncated output of the program with no changes in first.txt
To make optimizations without trading off accuracy or script features, I tried to make reasonable assumptions or only Pareto improvements.
I noticed the script only does inference but is configured for training which causes substantial overhead from allocating the optimizer and sharding. Not only is the optimizer explicitly assigned wildcard, but the script uses a no grad context manager and has no implication of intent to be used for RL, etc. So I swapped it for DeepSpeed inference. I specified bf16 instead of fp16 as it is generally considered a Pareto improvement and H100 supports it natively. This yielded a substantial speedup. I attached the truncated output of the program in second.txt
I specified
replace_with_kernel_inject=Falsebecause DeepSpeed would break; it doesn't support swiGLU activations yet, and I didn't want to make arch changes. I tried swapping the attention kernel for a quick win but the performance was actually a bit poor. Upon further thought, if I note that we are using Llama as a base, I can infer Meta engineers spent quite a bit of time making generally performant kernels and I probably won't make substantial algorithmic wins of this nature (this is also why I didn't write a custom kernel, etc). If I wanted to improve on their kernels on the specified hardware, it would probably make more sense for me to profile it with pytorch profiler and tackle the largest bottlenecks. But I only have like $3 of VastAI credits lol.However, Meta engineers didn't have the specification "improve the tokens per second attainable on a H100 box". So I compiled and auto-tuned the model so it would have block sizes and thread/warp mappings tuned for my specific GPU using torch.compile with max-autotune mode. I attached the truncated output of the program in third.txt
Overall, this yielded a ~10x improvement on the smallest model, and ~5x improvement on the largest model. And I'm out of Vast AI credits.