Skip to content

inference optimizations#17

Open
TheJDen wants to merge 1 commit intounconst:pipefrom
TheJDen:pipe
Open

inference optimizations#17
TheJDen wants to merge 1 commit intounconst:pipefrom
TheJDen:pipe

Conversation

@TheJDen
Copy link

@TheJDen TheJDen commented Oct 6, 2025

This is my submission based on the Discord message "To apply, submit a PR to this GitHub file which improves the tokens per seconds attainable on a H100 box with the model configurations from this script"

I tested this submission on a Vast AI PyTorch box with a H100 PCIe running Ubuntu 24.04 (noble). To configure the workspace, I used the following commands:
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install deepspeed-mii

To run the script, I used the following command:
deepspeed --num_gpus 1 deep.py

I attached the truncated output of the program with no changes in first.txt

To make optimizations without trading off accuracy or script features, I tried to make reasonable assumptions or only Pareto improvements.

I noticed the script only does inference but is configured for training which causes substantial overhead from allocating the optimizer and sharding. Not only is the optimizer explicitly assigned wildcard, but the script uses a no grad context manager and has no implication of intent to be used for RL, etc. So I swapped it for DeepSpeed inference. I specified bf16 instead of fp16 as it is generally considered a Pareto improvement and H100 supports it natively. This yielded a substantial speedup. I attached the truncated output of the program in second.txt

I specified replace_with_kernel_inject=False because DeepSpeed would break; it doesn't support swiGLU activations yet, and I didn't want to make arch changes. I tried swapping the attention kernel for a quick win but the performance was actually a bit poor. Upon further thought, if I note that we are using Llama as a base, I can infer Meta engineers spent quite a bit of time making generally performant kernels and I probably won't make substantial algorithmic wins of this nature (this is also why I didn't write a custom kernel, etc). If I wanted to improve on their kernels on the specified hardware, it would probably make more sense for me to profile it with pytorch profiler and tackle the largest bottlenecks. But I only have like $3 of VastAI credits lol.

However, Meta engineers didn't have the specification "improve the tokens per second attainable on a H100 box". So I compiled and auto-tuned the model so it would have block sizes and thread/warp mappings tuned for my specific GPU using torch.compile with max-autotune mode. I attached the truncated output of the program in third.txt

Overall, this yielded a ~10x improvement on the smallest model, and ~5x improvement on the largest model. And I'm out of Vast AI credits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant