LmDeploy推理加速与部署

lmdeploy github: https://github.com/InternLM/lmdeploy.

支持lmdeploy推理加速的多模态模型可以查看支持的模型.

环境准备

GPU设备: A10, 3090, V100, A100均可.

# 设置pip全局镜像 (加速下载)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# 安装ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'

# lmdeploy与cuda版本有对应关系，请按照`https://github.com/InternLM/lmdeploy#installation`进行安装
pip install lmdeploy

推理加速

使用python

OpenGVLab/InternVL2-2B

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# from swift.hub import HubApi
# _api = HubApi()
# _api.login('<your-sdk-token>')  # https://modelscope.cn/my/myaccesstoken

from swift.llm import (
    ModelType, get_lmdeploy_engine, get_default_template_type,
    get_template, inference_lmdeploy, inference_stream_lmdeploy
)

model_type = ModelType.internvl2_2b
model_id_or_path = None
lmdeploy_engine = get_lmdeploy_engine(model_type, model_id_or_path=model_id_or_path)
template_type = get_default_template_type(model_type)
template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
lmdeploy_engine.generation_config.max_new_tokens = 256
generation_info = {}

request_list = [{'query': '<image>描述图片', 'images': ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']},
                {'query': '你是谁？'},
                {'query': (
                    '<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>'
                    '<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>'
                    'What is the difference bewteen the two images?'
                )}]
resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
for request, resp in zip(request_list, resp_list):
    print(f"query: {request['query']}")
    print(f"response: {resp['response']}")
print(generation_info)

# stream
request_list = [{'query': '<video>描述视频', 'videos': ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']}]
gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
query = request_list[0]['query']
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for resp_list in gen:
    resp = resp_list[0]
    response = resp['response']
    delta = response[print_idx:]
    print(delta, end='', flush=True)
    print_idx = len(response)
print()
print(generation_info)

"""
query: <image>描述图片
response: 这张图片展示了四只卡通风格的羊，它们并排排列在一片绿色的草地上。草地上似乎还有轻微的阴影，显示了光源从左上方照射下来的效果。

从左到右，第一只羊头上顶着一团厚厚的羊毛，第二只羊稍微有点低，眼睛大大的，看起来非常友善；第三只羊和第四只羊头高且耳朵竖起，它们看起来似乎更威严和独立。

背景是一片绿色、蓝色的天空中有着一些白色的云朵，远处的山峦线条明显，使得整个画面充满自然的美感。

总体来说，这幅画作表达了一种和谐、宁静的自然氛围，四只羊的形象也很生动可爱，似乎传递出一种温柔、质朴的感觉。
query: 你是谁？
response: 我是InternVL，是由上海人工智能实验室的通用视觉团队（OpenGVLab）和商汤科技联合开发的模型。
query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>What is the difference bewteen the two images?
response: I'm unable to identify or compare images. However, if this image were to be classified based on its design or layout, it might demonstrate:

- A change in the display order or arrangement of elements within the image.
- An evolution in artistic style or technique.
- Different elements added or cut out to create a variation.

I'd need more specific details to make an accurate comparison.
{'num_prompt_tokens': 8095, 'num_generated_tokens': 253, 'num_samples': 3, 'runtime': 4.090330162958708, 'samples/s': 0.7334371262172084, 'tokens/s': 61.8531976443179}
query: <video>描述视频
response: 这个视频展示了一个小女孩在房间里读书的场景。镜头从一个特定的角度捕捉到小女孩专注于书本的情况。这个小女孩是金发，她穿着蓝色的无袖上衣，还戴着一副黑色的眼镜。小女孩的注意力集中在手中的一个白色封皮的书上，她轻轻翻开书页，显示出对书的兴趣和热爱。她的手偶尔会抚摸和翻看书页，表现出一种探索和专注的态度。

背景中可以看到一个木制的婴儿床，房间的地面铺满了温馨的米色毯子。房间的装饰温暖而家庭化，有一个带灯的台灯在床边，旁边似乎还放了一些玩具。

视频中，小女孩的右侧，她的左手上可以看到另一本书。这本书已经翻开到一页，她将目光转向这本书，用手指轻轻触碰到书本。她的身体稍微向前倾，显示出一种沉浸在阅读中的状态。

整个视频的色调温暖而柔和，背景中的物品和她的穿着形成了和谐的家庭氛围，给人一种舒适和温馨的感觉。小女孩的举止显得非常自然和放松，书页的翻动似乎也传递了她对这个故事的着迷。她用双手轻轻翻动书页的动作显得非常可爱和童真。

总之，这个视频完美捕捉到一个小女孩在家庭环境中享受阅读乐趣的瞬间
{'num_prompt_tokens': 6247, 'num_generated_tokens': 257, 'num_samples': 1, 'runtime': 3.0897628950187936, 'samples/s': 0.32364943006214636, 'tokens/s': 83.17790352597162}
"""

Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    ModelType, get_lmdeploy_engine, get_default_template_type,
    get_template, inference_lmdeploy, inference_stream_lmdeploy
)

# ModelType.qwen_vl_chat, ModelType.deepseek_vl_1_3b_chat, ModelType.minicpm_v_v2_5_chat
model_type = ModelType.internlm_xcomposer2_5_7b_chat
model_id_or_path = None
lmdeploy_engine = get_lmdeploy_engine(model_type, model_id_or_path=model_id_or_path)
template_type = get_default_template_type(model_type)
template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
# 与`transformers.GenerationConfig`类似的接口
lmdeploy_engine.generation_config.max_new_tokens = 256
generation_info = {}

request_list = [{'query': '<image>描述图片', 'images': ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']},
               ]
resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
for request, resp in zip(request_list, resp_list):
    print(f"query: {request['query']}")
    print(f"response: {resp['response']}")
print(generation_info)

# stream
history0 = resp_list[0]['history']
request_list = [{'query': '有几只羊', 'history': history0, 'images': ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']}]
gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
query = request_list[0]['query']
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for resp_list in gen:
    resp = resp_list[0]
    response = resp['response']
    delta = response[print_idx:]
    print(delta, end='', flush=True)
    print_idx = len(response)
print()

history = resp_list[0]['history']
print(f'history: {history}')
print(generation_info)
"""
query: <image>描述图片
response: 在图片中，有四只卡通风格的羊站在一片翠绿的草地中间。这些羊以简洁而不失真挚的形象出现，它们的躯干由白色和棕色的形状组成，而四肢则是纯粹的黑色。头部设计简洁，白色与棕色的搭配与整体协调一致。图中有四只羊，最突出的是一只最大的羊，它似乎处于图片中央，可能是画面的焦点。另外三只羊环绕在它的周围，形成一种对称感。这些羊们没有穿上任何衣物，它们在阳光下显得格外耀眼。天空是明亮的蓝色，背景中的山峰柔和地与天空相接，形成了一种宁静的田园景象。
{'num_prompt_tokens': 2206, 'num_generated_tokens': 132, 'num_samples': 1, 'runtime': 2.793646134901792, 'samples/s': 0.3579551423878365, 'tokens/s': 47.25007879519442}
query: 有几只羊
response: 图片中一共有四只羊。
history: [['<image>描述图片', '在图片中，有四只卡通风格的羊站在一片翠绿的草地中间。这些羊以简洁而不失真挚的形象出现，它们的躯干由白色和棕色的形状组成，而四肢则是纯粹的黑色。头部设计简洁，白色与棕色的搭配与整体协调一致。图中有四只羊，最突出的是一只最大的羊，它似乎处于图片中央，可能是画面的焦点。另外三只羊环绕在它的周围，形成一种对称感。这些羊们没有穿上任何衣物，它们在阳光下显得格外耀眼。天空是明亮的蓝色，背景中的山峰柔和地与天空相接，形成了一种宁静的田园景象。'], ['有几只羊', '图片中一共有四只羊。']]
{'num_prompt_tokens': 2352, 'num_generated_tokens': 6, 'num_samples': 1, 'runtime': 0.635085433954373, 'samples/s': 1.5745913014781, 'tokens/s': 9.447547808868599}
"""

TP:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

from swift.llm import (
    ModelType, get_lmdeploy_engine, get_default_template_type,
    get_template, inference_lmdeploy, inference_stream_lmdeploy
)

if __name__ == '__main__':
    model_type = ModelType.glm4v_9b_chat
    model_id_or_path = None
    lmdeploy_engine = get_lmdeploy_engine(model_type, model_id_or_path=model_id_or_path, tp=2)
    template_type = get_default_template_type(model_type)
    template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
    # 与`transformers.GenerationConfig`类似的接口
    lmdeploy_engine.generation_config.max_new_tokens = 256
    generation_info = {}

    request_list = [{'query': '<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>描述图片'},
                    {'query': '<image>描述图片', 'images': ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']},
                    {'query': '你是谁？'}]
    resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
    for request, resp in zip(request_list, resp_list):
        print(f"query: {request['query']}")
        print(f"response: {resp['response']}")
    print(generation_info)

    # stream
    history0 = resp_list[0]['history']
    request_list = [{'query': '有几只羊', 'history': history0}]
    gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
    query = request_list[0]['query']
    print_idx = 0
    print(f'query: {query}\nresponse: ', end='')
    for resp_list in gen:
        resp = resp_list[0]
        response = resp['response']
        delta = response[print_idx:]
        print(delta, end='', flush=True)
        print_idx = len(response)
    print()

    history = resp_list[0]['history']
    print(f'history: {history}')
    print(generation_info)

"""
query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>描述图片
response: 这张图片展示了一群羊站在草地上。具体来说，图片中共有四只羊，它们的身体都是白色的，头部也是白色的，但是它们的耳朵颜色和脸部的细节各不相同。从左到右，第一只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的；第二只羊的耳朵也是棕色的，脸部是白色的，鼻子是粉色的；第三只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的；第四只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的。四只羊站在绿色的草地上，草地呈现出不同的绿色阴影，显示出草地的起伏。在图片的背景中，可以看到蓝色的天空和几朵白云，以及连绵起伏的山脉。
query: <image>描述图片
response: 这张图片展示了一群羊站在草地上。具体来说，图片中共有四只羊，它们的身体都是白色的，头部也是白色的，但是它们的耳朵颜色和脸部的细节各不相同。从左到右，第一只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的；第二只羊的耳朵也是棕色的，脸部是白色的，鼻子是粉色的；第三只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的；第四只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的。四只羊站在绿色的草地上，草地呈现出不同的绿色阴影，显示出草地的起伏。在图片的背景中，可以看到蓝色的天空和几朵白云，以及连绵起伏的山脉。
query: 你是谁？
response: 我是人工智能助手智谱清言（ChatGLM），是基于智谱 AI 公司于 2023 年训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。
{'num_prompt_tokens': 3226, 'num_generated_tokens': 352, 'num_samples': 3, 'runtime': 9.829129087971523, 'samples/s': 0.3052152406535462, 'tokens/s': 35.81192157001609}
query: 有几只羊
response: 图中共有四只羊。
history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>描述图片', '这张图片展示了一群羊站在草地上。具体来说，图片中共有四只羊，它们的身体都是白色的，头部也是白色的，但是它们的耳朵颜色和脸部的细节各不相同。从左到右，第一只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的；第二只羊的耳朵也是棕色的，脸部是白色的，鼻子是粉色的；第三只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的；第四只羊的耳朵是棕色的，脸部是白色的，鼻子是粉色的。四只羊站在绿色的草地上，草地呈现出不同的绿色阴影，显示出草地的起伏。在图片的背景中，可以看到蓝色的天空和几朵白云，以及连绵起伏的山脉。'], ['有几只羊', '图中共有四只羊。']]
{'num_prompt_tokens': 1772, 'num_generated_tokens': 7, 'num_samples': 1, 'runtime': 1.6001809199806303, 'samples/s': 0.6249293361228834, 'tokens/s': 4.374505352860184}
"""

使用CLI

CUDA_VISIBLE_DEVICES=0 swift infer --model_type deepseek-vl-1_3b-chat --infer_backend lmdeploy

CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl2-2b --infer_backend lmdeploy

# TP
CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type qwen-vl-chat \
    --infer_backend lmdeploy --tp 2

CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internlm-xcomposer2_5-7b-chat \
    --infer_backend lmdeploy --tp 2

部署

服务端:

CUDA_VISIBLE_DEVICES=0 swift deploy --model_type deepseek-vl-1_3b-chat --infer_backend lmdeploy

CUDA_VISIBLE_DEVICES=0 swift deploy --model_type internvl2-2b --infer_backend lmdeploy

# TP
CUDA_VISIBLE_DEVICES=0,1 swift deploy --model_type qwen-vl-chat \
    --infer_backend lmdeploy --tp 2

CUDA_VISIBLE_DEVICES=0,1 swift deploy --model_type internlm-xcomposer2_5-7b-chat \
    --infer_backend lmdeploy --tp 2

客户端:

这里介绍对internvl2-2b进行客户端调用的展示:

from openai import OpenAI
client = OpenAI(
    api_key='EMPTY',
    base_url='http://localhost:8000/v1',
)
model_type = client.models.list().data[0].id
print(f'model_type: {model_type}')

# use base64
# import base64
# with open('baby.mp4', 'rb') as f:
#     vid_base64 = base64.b64encode(f.read()).decode('utf-8')
# video_url = f'data:video/mp4;base64,{vid_base64}'

# use local_path
# from swift.llm import convert_to_base64
# video_url = convert_to_base64(images=['baby.mp4'])['images'][0]
# video_url = f'data:video/mp4;base64,{video_url}'

# use url
video_url = 'https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4'

query = '描述这段视频'
messages = [{
    'role': 'user',
    'content': [
        {'type': 'video_url', 'video_url': {'url': video_url}},
        {'type': 'text', 'text': query},
    ]
}]
resp = client.chat.completions.create(
    model=model_type,
    messages=messages,
    temperature=0)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')

# 流式
query = '图中有几只羊'
image_url = 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png'
messages = [{
    'role': 'user',
    'content': [
        {'type': 'image_url', 'image_url': {'url': image_url}},
        {'type': 'text', 'text': query},
    ]
}]
stream_resp = client.chat.completions.create(
    model=model_type,
    messages=messages,
    stream=True,
    temperature=0)

print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
    print(chunk.choices[0].delta.content, end='', flush=True)
print()

"""
model_type: internvl2-2b
query: 描述这段视频
response: 这段视频展示了一个小女孩坐在床上，专注地阅读一本书。她戴着一副黑框眼镜，穿着浅绿色的无袖上衣，头发梳成马尾辫。视频中，小女孩的注意力完全集中在书本上，她用双手捧着书，时而翻页，时而抬头看向镜头。

背景中可以看到一个木制的婴儿床，床上铺着花纹的床单，旁边还有一些衣物和玩具。房间的墙壁上挂着一些装饰品，显得温馨而舒适。

视频中，小女孩的动作非常自然，她时而翻页，时而用手指轻轻拨动书页，显得非常专注和投入。她的表情平静而专注，似乎完全沉浸在书中的内容中。

整个视频给人一种温馨、宁静的感觉，小女孩的专注和认真让人感到非常温暖。视频中的每一个细节都展示了小女孩的纯真和好奇心，让人不禁想要和她一起探索书中的世界。
query: 图中有几只羊
response: 图中有四只羊。
"""

更多客户端调用方式可以查看: MLLM部署文档.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LmDeploy推理加速文档.md

LmDeploy推理加速文档.md

LmDeploy推理加速与部署

目录

环境准备

推理加速

使用python

使用CLI

部署

Files

LmDeploy推理加速文档.md

Latest commit

History

LmDeploy推理加速文档.md

File metadata and controls

LmDeploy推理加速与部署

目录

环境准备

推理加速

使用python

使用CLI

部署