Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多gpus如何使用? #581

Closed
xxm1668 opened this issue Jul 26, 2023 · 31 comments
Closed

多gpus如何使用? #581

xxm1668 opened this issue Jul 26, 2023 · 31 comments
Labels
usage How to use vllm

Comments

@xxm1668
Copy link

xxm1668 commented Jul 26, 2023

No description provided.

@nkfnn
Copy link

nkfnn commented Jul 27, 2023

model = LLM(model=base_model, tensor_parallel_size=2),两个gpu

@xxm1668
Copy link
Author

xxm1668 commented Jul 27, 2023

(Worker pid=2861953) [W socket.cpp:601] [c10d] The IPv6 network addresses of (internal_head, 19352) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
(Worker pid=2861953) [W socket.cpp:601] [c10d] The IPv6 network addresses of (internal_head, 19352) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 10x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(Worker pid=2861953) [W socket.cpp:601] [c10d] The IPv6 network addresses of (internal_head, 19352) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). [repeated 10x across cluster]

@xxm1668
Copy link
Author

xxm1668 commented Jul 27, 2023

export NCCL_IGNORE_DISABLED_P2P=1,可以解决;但有个问题哈,为啥多卡比单卡运行还慢呢?

@nkfnn
Copy link

nkfnn commented Jul 27, 2023

export NCCL_IGNORE_DISABLED_P2P=1,可以解决;但有个问题哈,为啥多卡比单卡运行还慢呢?

问下,您使用vllm后快了多少?我是用WizardCoder-15B-V1.0模型,使用vllm比hf快了一倍,但是网上都宣传快了27倍,我怀疑是不是我哪里设置没有到位。

@Sanster
Copy link
Contributor

Sanster commented Jul 27, 2023

export NCCL_IGNORE_DISABLED_P2P=1,可以解决;但有个问题哈,为啥多卡比单卡运行还慢呢?

问下,您使用vllm后快了多少?我是用WizardCoder-15B-V1.0模型,使用vllm比hf快了一倍,但是网上都宣传快了27倍,我怀疑是不是我哪里设置没有到位。

某个 issue 里看到过,27 倍应该是 batch 推理的加速效果

@nkfnn
Copy link

nkfnn commented Jul 27, 2023

哦哦,好的,谢谢

@xxm1668
Copy link
Author

xxm1668 commented Jul 27, 2023

我快了27倍,2卡反而慢了,差不多快了14倍,不知道多卡为啥慢了

@xxm1668
Copy link
Author

xxm1668 commented Jul 27, 2023

10361690444816_ pic_hd

@gesanqiu
Copy link
Contributor

Make sure you have already know the theory and goal of TP.
Usually, TP is use to solve the bottleneck of memory, for small size model, there is no need to use TP, multi-instances is better than use TP. Because when you use TP to a small model, you will meet the computing bottleneck of the GPU card itself. Considering of the communication cost, maybe you can only get 40-50% improvement when you use 2-GPUs(comparing to 1-GPU).

@xxm1668
Copy link
Author

xxm1668 commented Jul 27, 2023

thanks,i understand

@nkfnn
Copy link

nkfnn commented Jul 27, 2023

我快了27倍,2卡反而慢了,差不多快了14倍,不知道多卡为啥慢了

时间缩短了27倍嘛?可以给下您使用的模型,和hf的时间,以及缩短后的时间是多少吗?

@xxm1668
Copy link
Author

xxm1668 commented Jul 27, 2023

图片不是有吗?

@guozhiyao
Copy link

我快了27倍,2卡反而慢了,差不多快了14倍,不知道多卡为啥慢了

@xxm1668
请问下,你两卡是单独推理,还是弄成tp形式推理的呢?

@zlh1992
Copy link

zlh1992 commented Jul 31, 2023

batch

batch 推理什么意思呀?

@Kevinddddddd
Copy link

Make sure you have already know the theory and goal of TP. Usually, TP is use to solve the bottleneck of memory, for small size model, there is no need to use TP, multi-instances is better than use TP. Because when you use TP to a small model, you will meet the computing bottleneck of the GPU card itself. Considering of the communication cost, maybe you can only get 40-50% improvement when you use 2-GPUs(comparing to 1-GPU).

Hi, I have a question. Do you know how to set the number of instance in vllms?

@xxm1668
Copy link
Author

xxm1668 commented Aug 3, 2023

@guozhiyao 单卡双卡都有;
双卡使用tensfor_parallel_size=2

@xxm1668
Copy link
Author

xxm1668 commented Aug 3, 2023

@zlh1992 什么batch?

@zhuohan123
Copy link
Member

hi @xxm1668

vLLM has some random Python overheads that make TP slower than single GPU execution. We are investigating this and may eventually tackle this issue by providing C++ model implementations.

@wut0n9
Copy link

wut0n9 commented Sep 25, 2023

@zhuohan123 @XX
多卡推理变慢有解决了吗

@John-Ge
Copy link

John-Ge commented Oct 9, 2023

Hello, I have a question. If I have 8 gpu want to run llama 2 7b chat model, it is better to run 8 different programs. Now using vllm I should deploy 8 program and use 8 ports. Do you have plan to develop some kind of "automatic data parallel" function?

@runzeer
Copy link

runzeer commented Dec 15, 2023

为啥我用多卡推理时候会卡在"started a local Ray instance"这里呢?大佬有碰到这个问题的吗

@xxm1668
Copy link
Author

xxm1668 commented Dec 15, 2023

@runzeer export NCCL_IGNORE_DISABLED_P2P=1试试这个

@ArlanCooper
Copy link

@guozhiyao 单卡双卡都有; 双卡使用tensfor_parallel_size=2

设置这个参数,是将模型分别加载到两个gpu了吗?我这边设置成2,发现两个gpu都占了整个模型的大小,怎么可以让模型分布式加载到两个模型上呢?

@wenyangchou
Copy link

@guozhiyao 单卡双卡都有; 双卡使用tensfor_parallel_size=2

设置这个参数,是将模型分别加载到两个gpu了吗?我这边设置成2,发现两个gpu都占了整个模型的大小,怎么可以让模型分布式加载到两个模型上呢?

vllm是饿汉式加载内存,所以你看到的内存很可能不是模型内存,而是gpu_utilization 设置的内存

@ArlanCooper
Copy link

@guozhiyao 单卡双卡都有; 双卡使用tensfor_parallel_size=2

设置这个参数,是将模型分别加载到两个gpu了吗?我这边设置成2,发现两个gpu都占了整个模型的大小,怎么可以让模型分布式加载到两个模型上呢?

vllm是饿汉式加载内存,所以你看到的内存很可能不是模型内存,而是gpu_utilization 设置的内存

嗷嗷,感谢,就是通过设置gpu_memory_utilization可以调节需要加载的显存吗?

@gaojing8500
Copy link

为啥我用多卡推理时候会卡在"started a local Ray instance"这里呢?大佬有碰到这个问题的吗
8张4090卡也是一直卡在这里 不清楚为什么

@strongliu110
Copy link

哪位能给下多GPU 如何运行?试了设置 tensor_parallel_size=2,和 export NCCL_IGNORE_DISABLED_P2P=1 都还是报OOM。GPU 0 显存不够了,但是GPU 1 是足够的

@hwb96
Copy link

hwb96 commented Apr 17, 2024

tensor_parallel_size=2,并且记得,这个参数要能够被模型的层数整除,这样才能够平均放置模型的层数参数。否则会报错。tensor parallel size=2, and remember that this parameter must be evenly divided by the number of layers of the model, so that the number of layers of the model can be evenly placed. Otherwise, an error will be reported.

@DaoD
Copy link

DaoD commented May 7, 2024

tensor_parallel_size=2 means spliting the model into two GPUs rather than running two full models.

@fengshansi
Copy link

How can I use multiple A800 to load several small model? Ngnix reverse proxy?

@DarkLight1337 DarkLight1337 added the usage How to use vllm label May 31, 2024
@FrankMinions
Copy link

哪位能给下多GPU 如何运行?试了设置 tensor_parallel_size=2,和 export NCCL_IGNORE_DISABLED_P2P=1 都还是报OOM。GPU 0 显存不够了,但是GPU 1 是足够的

设置max_model_len参数

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests