Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Fix kernel bug #1959

Merged
merged 11 commits into from
Jan 3, 2024
Merged

[FIX] Fix kernel bug #1959

merged 11 commits into from
Jan 3, 2024

Conversation

jeejeelee
Copy link
Contributor

While assessing the effectiveness of the RMSNorm operator, I observed that executing this operator on non-zero GPU resulted in a 'RuntimeError: CUDA error: an illegal memory access was encountered.'
Upon further investigation through debugging, I found that cause as the absence of device guards, most cuda kernels have the same issues .
I have addressed the issue by incorporating device guards for all kernels. Additionally, I have augmented the kernel tests by including device id,such as in the test_activationprovided

@jeejeelee
Copy link
Contributor Author

jeejeelee commented Dec 7, 2023

I have completed all the kernel tests on the A800 GPU(x2), and all kernels can pass correctly.Can you review this PR? @WoosukKwon

@WoosukKwon WoosukKwon self-requested a review December 7, 2023 16:56
@WoosukKwon WoosukKwon self-assigned this Dec 7, 2023
@jeejeelee
Copy link
Contributor Author

@WoosukKwon Are there any issues with this PR?

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jeejeelee thanks for submitting the PR. We've not noticed this bug since the device id is always 0 in vLLM. However, I agree that this change would make the kernel more portable.

csrc/activation_kernels.cu Show resolved Hide resolved
Comment on lines 1 to 4
#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>

#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include "dispatch_utils.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit:

Suggested change
#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>
#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include "dispatch_utils.h"
#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include "dispatch_utils.h"

tests/kernels/conftest.py Outdated Show resolved Hide resolved
tests/kernels/test_activation.py Outdated Show resolved Hide resolved
@jeejeelee
Copy link
Contributor Author

jeejeelee commented Dec 28, 2023

@WoosukKwon Thank you for your review. I have completed the following modifications:

  1. revert cpp code format
  2. test the kernels for device 0 and 1.
  3. add comma for python codes

Please review again

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeejeelee LGTM! Thanks for the PR and apologies for the delayed review. We got many PRs last month and didn't have enough bandwidth due to the holidays. 😅

@WoosukKwon WoosukKwon merged commit 77af974 into vllm-project:main Jan 3, 2024
2 checks passed
jedibrillo pushed a commit to jedibrillo/vllm that referenced this pull request Jan 5, 2024
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants