Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel #7766

Merged

Conversation

dsikka
Copy link
Contributor

@dsikka dsikka commented Aug 22, 2024

Summary

  • Expands weight loading to support grouped and per channel weight quantization. Cleans-up fp8 MoE to use the updated weight loading
  • Adds Marlin Fused MoE Kernel for w4a16 by @ElizaWszola
  • Adds CompressedTensorsMoEMethod to support MoE w4a16 models from llm-compressor and compressed-tensors
  • Tested using 2 and 4 TP with Mixtral

Next Steps:

  • The CompressedTensorsMoEMethod is not leveraging the scheme structure in-place for compressed-tensors in order to keep the scope of this PR focused on the kernel + updated weight loading. Will be updated in a follow-up to use the scheme structure

co-authored by @ElizaWszola, from Neural Magic

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@dsikka dsikka marked this pull request as ready for review August 22, 2024 21:29
@dsikka
Copy link
Contributor Author

dsikka commented Aug 23, 2024

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 23, 2024
@halexan
Copy link

halexan commented Aug 27, 2024

W4A16, the "4" is int4? or fp4?

@dsikka
Copy link
Contributor Author

dsikka commented Aug 27, 2024

W4A16, the "4" is int4? or fp4?

Int4

dsikka and others added 22 commits August 27, 2024 12:55
FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (*link existing issues this PR will resolve*)

**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
DESCRIPTION ABOVE**

---

<details>
<!-- inside this <details> section, markdown rendering does not work, so
we use raw html here. -->
<summary><b> PR Checklist (Click to Expand) </b></summary>

<p>Thank you for your contribution to vLLM! Before submitting the pull
request, please ensure the PR meets the following criteria. This helps
vLLM maintain the code quality and improve the efficiency of the review
process.</p>

<h3>PR Title and Classification</h3>
<p>Only specific types of PRs will be reviewed. The PR title is prefixed
appropriately to indicate the type of change. Please use one of the
following:</p>
<ul>
    <li><code>[Bugfix]</code> for bug fixes.</li>
<li><code>[CI/Build]</code> for build or continuous integration
improvements.</li>
<li><code>[Doc]</code> for documentation fixes and improvements.</li>
<li><code>[Model]</code> for adding a new model or improving an existing
model. Model name should appear in the title.</li>
<li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
OpenAI API server, <code>LLM</code> class, etc.) </li>
<li><code>[Kernel]</code> for changes affecting CUDA kernels or other
compute kernels.</li>
<li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
<code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
<code>Scheduler</code>, etc.)</li>
<li><code>[Hardware][Vendor]</code> for hardware-specific changes.
Vendor name should appear in the prefix (e.g.,
<code>[Hardware][AMD]</code>).</li>
<li><code>[Misc]</code> for PRs that do not fit the above categories.
Please use this sparingly.</li>
</ul>
<p><strong>Note:</strong> If the PR spans more than one category, please
include all relevant prefixes.</p>

<h3>Code Quality</h3>

<p>The PR need to meet the following code quality standards:</p>

<ul>
<li>We adhere to <a
href="https://google.github.io/styleguide/pyguide.html">Google Python
style guide</a> and <a
href="https://google.github.io/styleguide/cppguide.html">Google C++
style guide</a>.</li>
<li>Pass all linter checks. Please use <a
href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
to format your code.</li>
<li>The code need to be well-documented to ensure future contributors
can easily understand the code.</li>
<li>Include sufficient tests to ensure the project to stay correct and
robust. This includes both unit tests and integration tests.</li>
<li>Please add documentation to <code>docs/source/</code> if the PR
modifies the user-facing behaviors of vLLM. It helps vLLM user
understand and utilize the new features or changes.</li>
</ul>

<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major
architectural changes (>500 LOC excluding kernel/data/config/test), we
would expect a GitHub issue (RFC) discussing the technical design and
justification. Otherwise, we will tag it with <code>rfc-required</code>
and might not go through the PR.</p>

<h3>What to Expect for the Reviews</h3>

<p>The goal of the vLLM team is to be a <i>transparent reviewing
machine</i>. We would like to make the review process transparent and
efficient and make sure no contributor feel confused or frustrated.
However, the vLLM team is small, so we need to prioritize some PRs over
others. Here is what you can expect from the review process: </p>

<ul>
<li> After the PR is submitted, the PR will be assigned to a reviewer.
Every reviewer will pick up the PRs based on their expertise and
availability.</li>
<li> After the PR is assigned, the reviewer will provide status update
every 2-3 days. If the PR is not reviewed within 7 days, please feel
free to ping the reviewer or the vLLM team.</li>
<li> After the review, the reviewer will put an <code>
action-required</code> label on the PR if there are changes required.
The contributor should address the comments and ping the reviewer to
re-review the PR.</li>
<li> Please respond to all comments within a reasonable time frame. If a
comment isn't clear or you disagree with a suggestion, feel free to ask
for clarification or discuss the suggestion.
 </li>
</ul>

<h3>Thank You</h3>

<p> Finally, thank you for taking the time to read these guidelines and
for your interest in contributing to vLLM. Your contributions make vLLM
a great tool for everyone! </p>

</details>
FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (*link existing issues this PR will resolve*)

**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
DESCRIPTION ABOVE**

---

<details>
<!-- inside this <details> section, markdown rendering does not work, so
we use raw html here. -->
<summary><b> PR Checklist (Click to Expand) </b></summary>

<p>Thank you for your contribution to vLLM! Before submitting the pull
request, please ensure the PR meets the following criteria. This helps
vLLM maintain the code quality and improve the efficiency of the review
process.</p>

<h3>PR Title and Classification</h3>
<p>Only specific types of PRs will be reviewed. The PR title is prefixed
appropriately to indicate the type of change. Please use one of the
following:</p>
<ul>
    <li><code>[Bugfix]</code> for bug fixes.</li>
<li><code>[CI/Build]</code> for build or continuous integration
improvements.</li>
<li><code>[Doc]</code> for documentation fixes and improvements.</li>
<li><code>[Model]</code> for adding a new model or improving an existing
model. Model name should appear in the title.</li>
<li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
OpenAI API server, <code>LLM</code> class, etc.) </li>
<li><code>[Kernel]</code> for changes affecting CUDA kernels or other
compute kernels.</li>
<li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
<code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
<code>Scheduler</code>, etc.)</li>
<li><code>[Hardware][Vendor]</code> for hardware-specific changes.
Vendor name should appear in the prefix (e.g.,
<code>[Hardware][AMD]</code>).</li>
<li><code>[Misc]</code> for PRs that do not fit the above categories.
Please use this sparingly.</li>
</ul>
<p><strong>Note:</strong> If the PR spans more than one category, please
include all relevant prefixes.</p>

<h3>Code Quality</h3>

<p>The PR need to meet the following code quality standards:</p>

<ul>
<li>We adhere to <a
href="https://google.github.io/styleguide/pyguide.html">Google Python
style guide</a> and <a
href="https://google.github.io/styleguide/cppguide.html">Google C++
style guide</a>.</li>
<li>Pass all linter checks. Please use <a
href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
to format your code.</li>
<li>The code need to be well-documented to ensure future contributors
can easily understand the code.</li>
<li>Include sufficient tests to ensure the project to stay correct and
robust. This includes both unit tests and integration tests.</li>
<li>Please add documentation to <code>docs/source/</code> if the PR
modifies the user-facing behaviors of vLLM. It helps vLLM user
understand and utilize the new features or changes.</li>
</ul>

<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major
architectural changes (>500 LOC excluding kernel/data/config/test), we
would expect a GitHub issue (RFC) discussing the technical design and
justification. Otherwise, we will tag it with <code>rfc-required</code>
and might not go through the PR.</p>

<h3>What to Expect for the Reviews</h3>

<p>The goal of the vLLM team is to be a <i>transparent reviewing
machine</i>. We would like to make the review process transparent and
efficient and make sure no contributor feel confused or frustrated.
However, the vLLM team is small, so we need to prioritize some PRs over
others. Here is what you can expect from the review process: </p>

<ul>
<li> After the PR is submitted, the PR will be assigned to a reviewer.
Every reviewer will pick up the PRs based on their expertise and
availability.</li>
<li> After the PR is assigned, the reviewer will provide status update
every 2-3 days. If the PR is not reviewed within 7 days, please feel
free to ping the reviewer or the vLLM team.</li>
<li> After the review, the reviewer will put an <code>
action-required</code> label on the PR if there are changes required.
The contributor should address the comments and ping the reviewer to
re-review the PR.</li>
<li> Please respond to all comments within a reasonable time frame. If a
comment isn't clear or you disagree with a suggestion, feel free to ask
for clarification or discuss the suggestion.
 </li>
</ul>

<h3>Thank You</h3>

<p> Finally, thank you for taking the time to read these guidelines and
for your interest in contributing to vLLM. Your contributions make vLLM
a great tool for everyone! </p>

</details>
@dsikka dsikka force-pushed the compressed-tensors-moe-updated branch from e59a1bd to eb72c6a Compare August 27, 2024 13:42
Comment on lines +16 to 19
compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-quantized, main
compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-channel-quantized, main
awq, casperhansen/mixtral-instruct-awq, main
awq_marlin, casperhansen/mixtral-instruct-awq, main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work: these mixtral models seem quite large to have in this test, maybe we should have a small and large test

@mgoin mgoin changed the title [Kerne] Expand MoE weight loading + Add Fused Marlin MoE Kernel [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel Aug 27, 2024
@WoosukKwon WoosukKwon merged commit fc91188 into vllm-project:main Aug 27, 2024
60 of 64 checks passed
@Maximilianxu
Copy link

@dsikka @mgoin Hi, is there a UT for this kernel? I saw the API has differences from Triton-based FusedMoE.

kushanam pushed a commit to kushanam/vllm that referenced this pull request Aug 28, 2024
kushanam pushed a commit to kushanam/vllm that referenced this pull request Aug 28, 2024
@B-201
Copy link
Contributor

B-201 commented Aug 29, 2024

@dsikka Hi! Thanks for your work. Do you have plans to support gptq models in the future?

@dsikka
Copy link
Contributor Author

dsikka commented Aug 29, 2024

@dsikka Hi! Thanks for your work. Do you have plans to support gptq models in the future?

Yup, this is in-scope to be worked on

@binxuan
Copy link

binxuan commented Aug 29, 2024

Hi, thanks for the work! I am wondering does this MoE kernel work on A100 GPU?

@mgoin
Copy link
Member

mgoin commented Aug 29, 2024

@binxuan Yes it is supported on A100 (SM 8.0 and up)

@binxuan
Copy link

binxuan commented Aug 29, 2024

Thanks for confirming. I tried the mainline code, but got following error. I think the device name returned from get_device_name somehow was bytes instead of str.

device_name = current_platform.get_device_name().replace(" ", "_")
TypeError: a bytes-like object is required, not 'str'

After fix this issue, got a second error from triton mentioning

AssertionError: fp8e4nv data type is not supported on CUDA arch < 89

@fengyang95
Copy link

fengyang95 commented Aug 30, 2024

If I want to use it with deepseek-v2, I saw that it uses fused_moe by default. Do I need to swap it out to get it running?

@dsikka
Copy link
Contributor Author

dsikka commented Aug 30, 2024

If I want to use it with deepseek-v2, I saw that it uses fused_moe by default. Do I need to swap it out to get it running?

Hi @fengyang95 - are you trying to run a W4A16 deepseek-v2 model?

@fengyang95
Copy link

fengyang95 commented Aug 31, 2024

If I want to use it with deepseek-v2, I saw that it uses fused_moe by default. Do I need to swap it out to get it running?

Hi @fengyang95 - are you trying to run a W4A16 deepseek-v2 model?

@dsikka YES, I am using the latest code, which seems to be using marlin_moe, but I encountered the following error.

  final_hidden_states = self.quant_method.apply(
  File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 275, in apply
    return fused_marlin_moe(x,
  File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 519, in fused_marlin_moe
    sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
  File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
    ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
  File "/usr/local/lib/python3.9/dist-packages/vllm/_custom_ops.py", line 29, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/vllm/_custom_ops.py", line 538, in moe_align_block_size
    torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
  File "/usr/local/lib/python3.9/dist-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@fengyang95
Copy link

Using H20 (sm_90) can start normally; is it also because it is currently not compatible with L40 (sm_89)? @dsikka

triple-Mu pushed a commit to triple-Mu/vllm_official that referenced this pull request Sep 4, 2024
Jeffwan pushed a commit to aibrix/vllm that referenced this pull request Sep 19, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
…m-project#7766)

Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants