Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix marlin kernel crash on H100 #4218

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

alexm-neuralmagic
Copy link
Collaborator

This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

Copy link
Collaborator

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to keep the cache hint? Seems pretty useful but if you measured no difference then it might be alright

@alexm-neuralmagic
Copy link
Collaborator Author

I tried various modifications to the PTX to keep the cache-hint, but it did not work.

Copy link
Collaborator

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, I validated the fix with the reproduction in neuralmagic#187. Always great to see fixes that make things simpler ❤️

@pcmoritz pcmoritz merged commit aae0824 into vllm-project:main Apr 24, 2024
47 checks passed
xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
mgoin added a commit to neuralmagic/nm-vllm that referenced this pull request May 16, 2024
The reason for the crash was the inline PTX assembly that introduced the
async_copy with streaming behavior. The solution is to use the more
standard PTX for async_copy (without the fractional L2 policy for
"evict_first"). There is no performance difference between standard
async_copy PTX and the previous one.
Ported from dense marlin:
vllm-project#4218
@alexanderwh
Copy link

Is there any data to prove that there is no difference for l2 cache hint method in performance? But why is this method introduced in the marlin original paper?

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
@alexanderwh
Copy link

Is there any way to keep the cache hint? Seems pretty useful but if you measured no difference then it might be alright

Hello, Is there any data to prove that there is no difference for l2 cache hint method in performance? I use a100 to run bench.py on marlin repo and it showes that on models other than llama 7B, l2 cache hint has performance improvement. Although bench.py actually only does matmul operation for simulation but does not have a full inference procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants