Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions examples/offline_inference/basic/basic_hpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,10 @@
model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-V2-Lite-NVFP4-autoround"
# model_path = "/software/users/yiliu4/deepseek-ai/DeepSeek-R1-NVFP4-OFFLINE"
model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN"

model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-MXFP4-llmc"
# model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-W4A4-RTN-GLOBAL-SCALE-WW"

model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-MXFP4-RTN"
Comment on lines 40 to +46

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block contains multiple assignments to model_path, many of which are immediately overwritten. This appears to be for local testing and should be cleaned up. Please consolidate this to a single default model_path and rely on the command-line argument --model_path to specify different models for testing.


import os

Expand Down
47 changes: 21 additions & 26 deletions examples/offline_inference/basic/start_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@

# VLLM_HPU_LOG_HPU_GRAPH=1 VLLM_DISABLE_INPUT_QDQ=0 bash start_vllm.sh --dummy-run
# VLLM_HPU_LOG_HPU_GRAPH=1 VLLM_DISABLE_INPUT_QDQ=0 bash start_vllm.sh --skip-warmup
# bash start_vllm.sh --skip-warmup --ds-nvfp4
# bash start_vllm.sh --skip-warmup --ds-nvfp4 --dummy-run
# bash start_vllm.sh --skip-warmup --ds-nvfp4 --dummy-run --skip-warmup --next_token
# bash start_vllm.sh --skip-warmup --ds-nvfp4
# bash start_vllm.sh --skip-warmup --ds-nvfp4 --skip-warmup --next_token

model_path=/mnt/disk3/yiliu4/DeepSeek-R1-G2-INC-424-Converter207/
Expand All @@ -13,9 +11,14 @@ model_path=/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-MXFP8-RTN
v2_model_path=/software/users/yiliu4/HF_HOME/Yi30/Yi30/DeepSeek-V2-Lite-MXFP8-llmc
mxfp4_model_path=/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-MXFP4-RTN
mxfp4_model_path=/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-bf16-MXFP4-autoround
nvfp4_model_path=/software/users/yiliu4/deepseek-ai/DeepSeek-R1-NVFP4-OFFLINE

nvfp4_model_path=/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround/
nvfp4_model_path="/software/users/yiliu4/deepseek-ai/DeepSeek-R1-nvfp4-fix-723"
nvfp4_model_path="/software/users/yiliu4/deepseek-ai/DeepSeek-R1-nvfp4-fix-723-skip-atten"
nvfp4_model_path=/software/users/yiliu4/deepseek-ai/DeepSeek-R1-NVFP4-OFFLINE
nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN"
nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN"
nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround"
Comment on lines 15 to +21

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are multiple assignments to nvfp4_model_path, including a duplicate. This appears to be for local testing and should be cleaned up to avoid confusion. Please retain only the necessary model path assignments.

tp_size=8

num_samples=128
Expand Down Expand Up @@ -81,6 +84,7 @@ done
# Debugging: Print the values of the variables
echo "USE_FP8_KV=$USE_FP8_KV"
echo "USE_NATIVE_SCALING=$USE_NATIVE_SCALING"
echo "model_path=$model_path"
echo "NEXT_TOKEN=$NEXT_TOKEN"


Expand All @@ -106,23 +110,26 @@ block_size=128
# DO NOT change ends...

# memory footprint tunning params
export VLLM_GPU_MEMORY_UTILIZATION=0.65
export VLLM_GPU_MEMORY_UTILIZATION=0.45
export VLLM_GRAPH_RESERVED_MEM=0.4
export VLLM_GRAPH_PROMPT_RATIO=0
export VLLM_MLA_DISABLE_REQUANTIZATION=0
export VLLM_DELAYED_SAMPLING="true"
#export VLLM_MOE_SLICE_LENGTH=20480


if [ "$NEXT_TOKEN" = true ]; then
echo "Enabling next token prediction"
export VLLM_DELAYED_SAMPLING="false"
task_name="mmlu"
else
echo "Disabling next token prediction"
export VLLM_DELAYED_SAMPLING="true"
fi
#export VLLM_MOE_SLICE_LENGTH=20480

# params
CONST_LEN=4096
CONST_LEN=16384
max_model_len=$CONST_LEN
max_num_batched_tokens=$CONST_LEN
max_num_seqs=32
Expand Down Expand Up @@ -252,7 +259,8 @@ fi
# add --max-num-prefill-seqs for next token prediction
if [ "$NEXT_TOKEN" = true ]; then
echo "Enabling next token prediction"
CMD="$CMD --max-num-prefill-seqs 1"
#CMD="$CMD --max-num-prefill-seqs 2"
CMD="$CMD --enforce-eager "
else
echo "Disabling next token prediction"
fi
Expand All @@ -278,13 +286,13 @@ echo "Server started with PID: ${pid}"

#===========================================================
# RUN BENCHMARK
#===========================================================
#===============================a============================
export no_proxy=localhost,127.0.0.1


model_base_name=$(basename $model_path)

EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output_${task_name}_bs${batch_size}__${timestamp}"
EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output__bs${batch_size}__${timestamp}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The task_name variable is no longer included in EVAL_LOG_NAME, but it is still used in the echo command on line 297 and the lm_eval command on line 303. This creates a mismatch between the log message and the actual log file name. For better traceability, consider adding task_name back to the log file name.

Suggested change
EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output__bs${batch_size}__${timestamp}"
EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output_${task_name}_bs${batch_size}__${timestamp}"


echo "Running lm_eval with model: ${model_path}, task: ${task_name}, batch size: ${batch_size}, num samples: ${num_samples}"

Expand All @@ -296,30 +304,17 @@ lm_eval --model local-completions \
--model_args model=${model_path},base_url=http://127.0.0.1:8688/v1/completions,max_concurrent=1 \
--batch_size 32 \
--confirm_run_unsafe_code \
--limit $num_samples \
--log_samples \
--output_path "benchmark_logs/$EVAL_LOG_NAME" \
2>&1 | tee "benchmark_logs/${EVAL_LOG_NAME}.log"





end_time=$(date +%s)
echo "Benchmark completed in $((end_time - start_time)) seconds"

# Clean up
echo "Stopping vLLM server"
kill ${pid}
echo "Script execution completed"
sleep 10



# lm_eval --model local-completions \
# --tasks "$task_name" \
# --model_args model=${model_path},base_url=http://127.0.0.1:8688/v1/completions,max_concurrent=1 \
# --batch_size 32 \
# --confirm_run_unsafe_code \
# --limit $num_samples \
# --log_samples
# echo "Stopping vLLM server"
#kill ${pid}
#echo "Script execution completed"
#sleep 10
Comment on lines +317 to +320

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The server cleanup logic (kill ${pid}) is commented out. This will leave the vLLM server process running after the script completes, which can consume resources unnecessarily. If this was for debugging, please re-enable it.

Suggested change
# echo "Stopping vLLM server"
#kill ${pid}
#echo "Script execution completed"
#sleep 10
echo "Stopping vLLM server"
kill ${pid}
echo "Script execution completed"
sleep 10

1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ requires = [
"packaging>=24.2",
"setuptools>=77.0.3,<80.0.0",
"setuptools-scm>=8.0",
"torch == 2.7.0",
"wheel",
"jinja2",
]
Expand Down
1 change: 0 additions & 1 deletion requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,5 @@ ninja
packaging>=24.2
setuptools>=77.0.3,<80.0.0
setuptools-scm>=8
torch==2.7.0
wheel
jinja2>=3.1.6
3 changes: 2 additions & 1 deletion requirements/common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ requests >= 2.26.0
tqdm
blake3
py-cpuinfo
datasets == 3.6.0
transformers == 4.53.2
huggingface-hub[hf_xet] >= 0.30.0 # Required for Xet downloads.
tokenizers >= 0.21.1 # Required for fast incremental detokenization.
Expand Down Expand Up @@ -37,7 +38,6 @@ six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that need
setuptools>=77.0.3,<80; python_version > '3.11' # Setuptools is used by triton, we need to ensure a modern version is installed for 3.12+ so that it does not try to import distutils, which was removed in 3.12
einops # Required for Qwen2-VL.
# compressed-tensors == 0.10.2 # required for compressed-tensors
torchao @ git+https://github.com/yiliu30/torchao-fork.git@mxfp8
compressed-tensors @ git+https://github.com/yiliu30/compressed-tensors-fork.git@mxfp4
depyf==0.18.0 # required for profiling and debugging with compilation config
cloudpickle # allows pickling lambda functions in model_executor/models/registry.py
Expand All @@ -49,3 +49,4 @@ opentelemetry-sdk>=1.26.0,<1.27.0 # vllm.tracing
opentelemetry-api>=1.26.0,<1.27.0 # vllm.tracing
opentelemetry-exporter-otlp>=1.26.0,<1.27.0 # vllm.tracing
opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0 # vllm.tracing
torchao @ git+https://github.com/yiliu30/torchao-fork.git@mxfp8
11 changes: 0 additions & 11 deletions requirements/cpu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,8 @@
-r common.txt

# Dependencies for CPUs
--extra-index-url https://download.pytorch.org/whl/cpu
torch==2.7.0+cpu; platform_machine == "x86_64"
torch==2.7.0; platform_system == "Darwin"
torch==2.7.0; platform_machine == "ppc64le" or platform_machine == "aarch64"
torch==2.7.0.dev20250304; platform_machine == "s390x"

# required for the image processor of minicpm-o-2_6, this must be updated alongside torch
torchaudio; platform_machine != "ppc64le" and platform_machine != "s390x"
torchaudio==2.7.0; platform_machine == "ppc64le"

# required for the image processor of phi3v, this must be updated alongside torch
torchvision; platform_machine != "ppc64le" and platform_machine != "s390x"
torchvision==0.22.0; platform_machine == "ppc64le"
datasets # for benchmark scripts

# cpu cannot use triton 3.3.0
Expand Down
5 changes: 0 additions & 5 deletions requirements/cuda.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,4 @@ numba == 0.61.2; python_version > '3.9'

# Dependencies for NVIDIA GPUs
ray[cgraph]>=2.43.0, !=2.44.* # Ray Compiled Graph, required for pipeline parallelism in V1.
torch==2.7.0
torchaudio==2.7.0
# These must be updated alongside torch
torchvision==0.22.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
# https://github.com/facebookresearch/xformers/releases/tag/v0.0.30
xformers==0.0.30; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch >= 2.7
2 changes: 0 additions & 2 deletions requirements/docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,3 @@ commonmark # Required by sphinx-argparse when using :markdownhelp:

# packages to install to build the documentation
cachetools
-f https://download.pytorch.org/whl/cpu
torch
1 change: 0 additions & 1 deletion requirements/neuron.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,4 @@
# Dependencies for Neuron devices
packaging>=24.2
setuptools>=77.0.3,<80.0.0
torch-neuronx >= 2.5.0
neuronx-cc
1 change: 0 additions & 1 deletion requirements/nightly_torch_test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,4 @@ lm-eval[api]==0.4.8 # required for model evaluation test
bitsandbytes>=0.45.3

# required for minicpmo_26 test
vector_quantize_pytorch
vocos
4 changes: 0 additions & 4 deletions requirements/rocm-build.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
# Common dependencies
-r common.txt

--extra-index-url https://download.pytorch.org/whl/rocm6.2.4
torch==2.7.0
torchvision==0.22.0
torchaudio==2.7.0

triton==3.2
cmake>=3.26,<4
Expand Down
Loading