Support Mixtral quantization using HQT #67

dudilester · 2024-06-20T11:57:45Z

No description provided.

…use of matmul class

dudilester · 2024-06-24T08:33:37Z

Wrapped the habana static_fused_moe function using a Class
Wrapped the MoE matmul calculations using a Class as well.
When running inference with HQT quantization, transposed weights of the different MoEs are calculated once and statically saved (to avoid re-calculation on each forward call)
MatmulMoe Calsses of the StaticFusedMoe instance are patched using the HQT to quantize its MoE weights and stat tensors.

nirda7 · 2024-06-24T14:16:23Z

vllm/hpu/ops.py

-        final_hidden_states += current_hidden_states_static
-
-    return final_hidden_states.view(-1, D)
+class MoeMatmul(nn.Module):


Better call it MoeLinear as it more acts like a linear than a matmul

or just use Linear without bias

Conflicts: vllm/hpu/ops.py

Initial FP8 support

It causes OOM on 70b

Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>

This reverts commit 1dc6cb2.

This reverts commit 4afe86d.

remove expert_max hard code (#47) vLLM-Ext: Full enabling of ALiBi (#34) Add version inference via setuptools-scm (#58) Revert "vLLM-Ext: Full enabling of ALiBi (#34)" (#59) Remove punica_hpu.py from vllm_hpu_extension (#66) Removed previous (not-pipelined) pa implementation (#72) Add flag to enable running softmax in fp32 (#71) Update calibration readme link (#73) allow lm_head quantization in calibration process (#65) Pad to bmin if value is less (#67) Update pyproject.toml (#75) --------- Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

* fix server crash when the client use random seed sampling * fix lint

nirda7 added 7 commits June 18, 2024 11:39

support hqt on vllm

4b9b955

Support HQT on VLLM - KVCache and Mark Step uses

f3ffc8c

HQT on VLLM - prep model and finish measurements and multi cards run

8ffc3d0

HQT on VLLM - separate kv caches

f5f0972

HQT on VLLM - remove code duplications

c521c4d

HQT on VLLM - move matmul and softmax to hpu utils and revert logits …

64c8c7f

…use of matmul class

Move model to hpu when HQT is not used

2e291c5

dudilester requested review from HolyFalafel, MrGeva, Yantom1, nirda7 and bgoldberg-habana June 20, 2024 11:58

nirda7 added 2 commits June 21, 2024 22:37

fix CR comments

9d0fbb7

add model weights device load

09e0078

nirda7 reviewed Jun 24, 2024

View reviewed changes

nirda7 and others added 7 commits June 26, 2024 13:44

skip replay cached graphs during warmup

24847a9

HQT on VLLM - Enable split value in G3

90c2527

pass optimizations flags only in Lazy mode

f7c2157

Merge remote-tracking branch 'origin/habana_next' into vllm-hqt-fork

83770dc

Conflicts: vllm/hpu/ops.py

Filter-out warmup_mode before passing to model.forward

ae1d3f4

Merge pull request #75 from HabanaAI/vllm-hqt-fork

33a2620

Initial FP8 support

Profile single forward (#68)

566bdd2

dudilester force-pushed the dev/dlester/mixtral_hqt branch from 5d9f4de to e814a4a Compare July 1, 2024 09:18

Krzysztof Laskowski and others added 6 commits July 1, 2024 16:50

Skip logprobs processing for greedy

55ea726

Fix lower bucket range calculation

0674aea

Disable warmup_mode for now

15c67ed

It causes OOM on 70b

Introduce delayed sampling mechanism (#84)

77e1ab8

Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>

Disable tensor cache set to True (#88)

1dc6cb2

Revert "Disable tensor cache set to True (#88)" (#89)

4afe86d

This reverts commit 1dc6cb2.

Revert "Revert "Disable tensor cache set to True (#88)" (#89)" (#90)

ca1dbf6

This reverts commit 4afe86d.

dudilester force-pushed the dev/dlester/mixtral_hqt branch from e814a4a to f4f3437 Compare July 7, 2024 09:46

Support Mixtral quantization using HQT

87d95ad

dudilester force-pushed the dev/dlester/mixtral_hqt branch from f4f3437 to 87d95ad Compare July 7, 2024 11:06

dudilester closed this Jul 24, 2024

dudilester deleted the dev/dlester/mixtral_hqt branch July 24, 2024 12:13

mfylcek mentioned this pull request Jan 14, 2025

Set vllm-hpu-extension to 6ac93fb #684

Merged

michalkuligowski mentioned this pull request Jan 15, 2025

Update requirements-hpu.txt #685

Closed

yangulei added a commit to yangulei/vllm-fork that referenced this pull request Mar 7, 2025

fix server crash when the client use random seed sampling (HabanaAI#67)

b682ea9

* fix server crash when the client use random seed sampling * fix lint

yangulei added a commit to yangulei/vllm-fork that referenced this pull request Mar 11, 2025

fix server crash when the client use random seed sampling (HabanaAI#67)

9487574

* fix server crash when the client use random seed sampling * fix lint

ranzhejiang pushed a commit to ranzhejiang/vllm-fork that referenced this pull request Apr 11, 2025

fix server crash when the client use random seed sampling (HabanaAI#67)

a16692b

* fix server crash when the client use random seed sampling * fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Mixtral quantization using HQT #67

Support Mixtral quantization using HQT #67

Uh oh!

dudilester commented Jun 20, 2024

Uh oh!

dudilester commented Jun 24, 2024

Uh oh!

nirda7 Jun 24, 2024

Uh oh!

nirda7 Jun 24, 2024

Uh oh!

Uh oh!

Support Mixtral quantization using HQT #67

Support Mixtral quantization using HQT #67

Uh oh!

Conversation

dudilester commented Jun 20, 2024

Uh oh!

dudilester commented Jun 24, 2024

Uh oh!

nirda7 Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

nirda7 Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!