-
Notifications
You must be signed in to change notification settings - Fork 115
Support Mixtral quantization using HQT #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…use of matmul class
Wrapped the habana static_fused_moe function using a Class |
final_hidden_states += current_hidden_states_static | ||
|
||
return final_hidden_states.view(-1, D) | ||
class MoeMatmul(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better call it MoeLinear as it more acts like a linear than a matmul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or just use Linear without bias
Conflicts: vllm/hpu/ops.py
Initial FP8 support
5d9f4de
to
e814a4a
Compare
It causes OOM on 70b
Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>
e814a4a
to
f4f3437
Compare
f4f3437
to
87d95ad
Compare
remove expert_max hard code (#47) vLLM-Ext: Full enabling of ALiBi (#34) Add version inference via setuptools-scm (#58) Revert "vLLM-Ext: Full enabling of ALiBi (#34)" (#59) Remove punica_hpu.py from vllm_hpu_extension (#66) Removed previous (not-pipelined) pa implementation (#72) Add flag to enable running softmax in fp32 (#71) Update calibration readme link (#73) allow lm_head quantization in calibration process (#65) Pad to bmin if value is less (#67) Update pyproject.toml (#75) --------- Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
remove expert_max hard code (#47) vLLM-Ext: Full enabling of ALiBi (#34) Add version inference via setuptools-scm (#58) Revert "vLLM-Ext: Full enabling of ALiBi (#34)" (#59) Remove punica_hpu.py from vllm_hpu_extension (#66) Removed previous (not-pipelined) pa implementation (#72) Add flag to enable running softmax in fp32 (#71) Update calibration readme link (#73) allow lm_head quantization in calibration process (#65) Pad to bmin if value is less (#67) Update pyproject.toml (#75) --------- Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
* fix server crash when the client use random seed sampling * fix lint
* fix server crash when the client use random seed sampling * fix lint
* fix server crash when the client use random seed sampling * fix lint
No description provided.