-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多gpus如何使用? #581
Comments
model = LLM(model=base_model, tensor_parallel_size=2),两个gpu |
(Worker pid=2861953) [W socket.cpp:601] [c10d] The IPv6 network addresses of (internal_head, 19352) cannot be retrieved (gai error: -3 - Temporary failure in name resolution). |
export NCCL_IGNORE_DISABLED_P2P=1,可以解决;但有个问题哈,为啥多卡比单卡运行还慢呢? |
问下,您使用vllm后快了多少?我是用WizardCoder-15B-V1.0模型,使用vllm比hf快了一倍,但是网上都宣传快了27倍,我怀疑是不是我哪里设置没有到位。 |
某个 issue 里看到过,27 倍应该是 batch 推理的加速效果 |
哦哦,好的,谢谢 |
我快了27倍,2卡反而慢了,差不多快了14倍,不知道多卡为啥慢了 |
Make sure you have already know the theory and goal of TP. |
thanks,i understand |
时间缩短了27倍嘛?可以给下您使用的模型,和hf的时间,以及缩短后的时间是多少吗? |
图片不是有吗? |
@xxm1668 |
batch 推理什么意思呀? |
Hi, I have a question. Do you know how to set the number of instance in vllms? |
@guozhiyao 单卡双卡都有; |
@zlh1992 什么batch? |
hi @xxm1668 vLLM has some random Python overheads that make TP slower than single GPU execution. We are investigating this and may eventually tackle this issue by providing C++ model implementations. |
@zhuohan123 @XX |
Hello, I have a question. If I have 8 gpu want to run llama 2 7b chat model, it is better to run 8 different programs. Now using vllm I should deploy 8 program and use 8 ports. Do you have plan to develop some kind of "automatic data parallel" function? |
为啥我用多卡推理时候会卡在"started a local Ray instance"这里呢?大佬有碰到这个问题的吗 |
@runzeer export NCCL_IGNORE_DISABLED_P2P=1试试这个 |
设置这个参数,是将模型分别加载到两个gpu了吗?我这边设置成2,发现两个gpu都占了整个模型的大小,怎么可以让模型分布式加载到两个模型上呢? |
vllm是饿汉式加载内存,所以你看到的内存很可能不是模型内存,而是gpu_utilization 设置的内存 |
嗷嗷,感谢,就是通过设置gpu_memory_utilization可以调节需要加载的显存吗? |
|
哪位能给下多GPU 如何运行?试了设置 tensor_parallel_size=2,和 export NCCL_IGNORE_DISABLED_P2P=1 都还是报OOM。GPU 0 显存不够了,但是GPU 1 是足够的 |
tensor_parallel_size=2,并且记得,这个参数要能够被模型的层数整除,这样才能够平均放置模型的层数参数。否则会报错。tensor parallel size=2, and remember that this parameter must be evenly divided by the number of layers of the model, so that the number of layers of the model can be evenly placed. Otherwise, an error will be reported. |
tensor_parallel_size=2 means spliting the model into two GPUs rather than running two full models. |
How can I use multiple A800 to load several small model? Ngnix reverse proxy? |
设置max_model_len参数 |
No description provided.
The text was updated successfully, but these errors were encountered: