Skip to content

Commit fd2cc1b

Browse files
SidaoYyx0716MengqingCao
authored
[Docs] Add Tutorials for Online Serving on Multi Machine (vllm-project#120)
Add Tutorials for Online Serving on Multi Machine --------- Signed-off-by: SidaoY <1024863041@qq.com> Co-authored-by: yx0716 <jinyx1007@foxmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>
1 parent 3a4ce2a commit fd2cc1b

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

docs/source/tutorials.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,3 +207,105 @@ If you run this script successfully, you can see the info shown below:
207207
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
208208
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
209209
```
210+
211+
## Online Serving on Multi Machine
212+
213+
Run docker container on each machine:
214+
215+
```shell
216+
docker run \
217+
--name vllm-ascend \
218+
--device /dev/davinci0 \
219+
--device /dev/davinci1 \
220+
--device /dev/davinci2\
221+
--device /dev/davinci3 \
222+
--device /dev/davinci4 \
223+
--device /dev/davinci5 \
224+
--device /dev/davinci6 \
225+
--device /dev/davinci7 \
226+
--device /dev/davinci_manager \
227+
--device /dev/devmm_svm \
228+
--device /dev/hisi_hdc \
229+
-v /usr/local/dcmi:/usr/local/dcmi \
230+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
231+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
232+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
233+
-v /etc/ascend_install.info:/etc/ascend_install.info \
234+
-v /root/.cache:/root/.cache \
235+
-p 8000:8000 \
236+
-it quay.io/ascend/vllm-ascend:v0.7.1rc1 bash
237+
```
238+
239+
Choose one machine as head node, the other are worker nodes, then start ray on each machine:
240+
:::{note} Check out your `nic_name` by command `ip addr` :::
241+
242+
```shell
243+
# Head node
244+
export HCCL_IF_IP={local_ip}
245+
export GLOO_SOCKET_IFNAME={nic_name}
246+
export TP_SOCKET_IFNAME={nic_name}
247+
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
248+
ray start --head --num-gpus=8
249+
250+
# Worker node
251+
export HCCL_IF_IP={local_ip}
252+
export ASCEND_PROCESS_LOG_PATH={plog_save_path}
253+
export GLOO_SOCKET_IFNAME={nic_name}
254+
export TP_SOCKET_IFNAME={nic_name}
255+
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
256+
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
257+
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
258+
```
259+
260+
Start the vLLM server on head node:
261+
262+
```shell
263+
export VLLM_HOST_IP={head_node_ip}
264+
export HCCL_CONNECT_TIMEOUT=120
265+
export ASCEND_PROCESS_LOG_PATH={plog_save_path}
266+
export HCCL_IF_IP={head_node_ip}
267+
268+
if [ -d "{plog_save_path}" ]; then
269+
rm -rf {plog_save_path}
270+
echo ">>> remove {plog_save_path}"
271+
fi
272+
273+
LOG_FILE="multinode_$(date +%Y%m%d_%H%M).log"
274+
VLLM_TORCH_PROFILER_DIR=./vllm_profile
275+
python -m vllm.entrypoints.openai.api_server \
276+
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
277+
--trust-remote-code \
278+
--enforce-eager \
279+
--max-model-len {max_model_len} \
280+
--distributed_executor_backend "ray" \
281+
--tensor-parallel-size 16 \
282+
--disable-log-requests \
283+
--disable-log-stats \
284+
--disable-frontend-multiprocessing \
285+
--port {port_num} \
286+
```
287+
288+
Once your server is started, you can query the model with input prompts:
289+
290+
```shell
291+
curl -X POST http://127.0.0.1:{prot_num}/v1/completions \
292+
-H "Content-Type: application/json" \
293+
-d '{
294+
"model": "Deepseek/DeepSeek-V2-Lite-Chat",
295+
"prompt": "The future of AI is",
296+
"max_tokens": 24
297+
}'
298+
```
299+
300+
If you query the server successfully, you can see the info shown below (client):
301+
302+
```
303+
{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
304+
```
305+
306+
Logs of the vllm server:
307+
308+
```
309+
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK
310+
INFO 02-19 17:37:35 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
311+
```

0 commit comments

Comments
 (0)