Skip to content

removing quant and kv-cache fp8 from deepseek run instructions #509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

arakowsk-amd
Copy link

No description provided.

Copy link
Collaborator

@shajrawi shajrawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add description of why you are proposing this

@@ -377,7 +377,7 @@ python3 /app/vllm/benchmarks/benchmark_serving.py \
# Offline throughput
python3 /app/vllm/benchmarks/benchmark_throughput.py --model deepseek-ai/DeepSeek-V3 \
--input-len <> --output-len <> --tensor-parallel-size 8 \
--quantization fp8 --kv-cache-dtype fp8 --dtype float16 \
--dtype float16 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises error an error:

export VLLM_MLA_DISABLE=0
export VLLM_USE_AITER=1
export VLLM_USE_TRITON_FLASH_ATTN=1
python3 /app/vllm/benchmarks/benchmark_throughput.py --model /data/DeepSeek-R1/ --input-len 128 --output-len 128 --tensor-parallel-size 8 --quantization fp8 --kv-cache-dtype fp8 --dtype bfloat16 --max-model-len 32768 --block-size=1 --trust-remote-code
 
 
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/triton_mla.py", line 63, in __init__
[rank0]:     raise NotImplementedError(
[rank0]: NotImplementedError: TritonMLA with FP8 KV cache not yet supported

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Triton MLA being used with AITER? cc @qli88

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arakowsk-amd are you using the latest version? If you'd like we can discuss through Teams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants