多GPU运行 #16

wenxinmomo · 2024-03-13T07:54:07Z

您好，我尝试了多GPU运行，但是一直没有成功，请问您有什么好的方法吗

Furyton · 2024-03-15T03:32:30Z

您好，可以分享下运行的脚本代码吗，或者报错信息。

wenxinmomo · 2024-03-18T08:44:56Z

问题：有多GPU运行的解决方案吗。

我的尝试：
您好，代码如下：

from transformers import AutoTokenizer, AutoModel



if __name__ == '__main__':
    model_url = "/data/minio01/model_file/fuzi_model"
    tokenizer = AutoTokenizer.from_pretrained(model_url, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_url, device_map="auto", trust_remote_code=True).half().cuda()
    response, history = model.chat(tokenizer, "你好", history=[])
    print(response)
    response, history = model.chat(tokenizer, "你能做什么", history=history)
    print(response)

我在获取model时加了device_map="auto"，但是真正运行起来，仍是第一个GPU主要在跑，其他的GPU显存由2M上升到了477M，但是没有明显的加速，似乎也确实没有在运行。

Furyton · 2024-03-21T10:34:51Z

您可以试一下在运行脚本时设置环境变量

CUDA_VISIBLE_DEVICES=0,1,2,3 python script.py

Furyton · 2024-03-21T10:50:24Z

多卡运行的 python 脚本可以参考 ChatGLM-6B

通过 CUDA_VISIBLE_DEVICES=0,1,2,3 python script.py 可以进行 4 卡推理。

# script.py
from transformers import AutoTokenizer, AutoModel

import os
from typing import Dict, Tuple, Union, Optional

from torch.nn import Module


def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map


def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module:
    if num_gpus < 2 and device_map is None:
        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda()
    else:
        from accelerate import dispatch_model

        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()

        if device_map is None:
            device_map = auto_configure_device_map(num_gpus)

        model = dispatch_model(model, device_map=device_map)

    return model



if __name__ == '__main__':
    model_url = "/data/minio01/model_file/fuzi_model"
    tokenizer = AutoTokenizer.from_pretrained(model_url, trust_remote_code=True)
    # model = AutoModel.from_pretrained(model_url, device_map="auto", trust_remote_code=True).half().cuda()
    model = load_model_on_gpus(model_url, num_gpus=4)
    response, history = model.chat(tokenizer, "你好", history=[])
    print(response)
    response, history = model.chat(tokenizer, "你能做什么", history=history)
    print(response)

wenxinmomo · 2024-03-26T02:16:18Z

非常感谢您的回复，利用您提供的代码，我成功运行了多GPU。
但是多GPU跑的时候，有时甚至没有单卡运行的速度快。并且每个GPU运行时，大概只能占用20%左右。
请问您有更好的调优方法吗，期待您的回复。
对于夫子明察模式一，运行时间在60s左右。

Furyton · 2024-03-30T09:11:43Z

您好，多卡的模型并行（将模型拆分到不同的 GPU 上）主要是解决单卡显存不足的问题，而不是为了加速。使用多卡因为涉及多进程间的通信是比单卡运行要慢的。当单卡显存足够的情况下一般不需要多卡运行。

wenxinmomo · 2024-04-08T02:28:50Z

非常感谢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多GPU运行 #16

多GPU运行 #16

wenxinmomo commented Mar 13, 2024

Furyton commented Mar 15, 2024

wenxinmomo commented Mar 18, 2024

Furyton commented Mar 21, 2024

Furyton commented Mar 21, 2024

wenxinmomo commented Mar 26, 2024

Furyton commented Mar 30, 2024

wenxinmomo commented Apr 8, 2024

多GPU运行 #16

多GPU运行 #16

Comments

wenxinmomo commented Mar 13, 2024

Furyton commented Mar 15, 2024

wenxinmomo commented Mar 18, 2024

Furyton commented Mar 21, 2024

Furyton commented Mar 21, 2024

wenxinmomo commented Mar 26, 2024

Furyton commented Mar 30, 2024

wenxinmomo commented Apr 8, 2024