docs(project): add alpaca and chatglm profiling

MegEngine · Jun 14, 2023 · a12dee9 · a12dee9
1 parent 1033809
commit a12dee9
Show file tree

Hide file tree

Showing 4 changed files with 54 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -38,8 +38,7 @@ If it is executed locally, execute `./chatglm -m chatglm-q4.bin -t 4` directly.
 - android is xiaomi9，Qualcomm SM8150 Snapdragon 855
 ![android running](./assets/arm-mi9.gif)
 
-### supported model
+### Supported model
 Now InferLLM supports [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B), [llama](https://github.com/facebookresearch/llama), [alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) models.
 ### License
 InferLLM is licensed under the Apache License, Version 2.0
-
diff --git a/README_Chinese.md b/README_Chinese.md
@@ -5,14 +5,14 @@ InferLLM 是一个非常轻量的 LLM 模型推理框架，主要参考和借鉴
 - 结构简单，方便上手开发和学习，把框架部分和 Kernel 部分进行了解耦
 - 运行高效，将 llama.cpp 中大多数的 kernel 都进行了移植
 - 定义了专门的 KVstorage 类型，方便缓存和管理
-- 可以兼容多种模型格式（目前只支持 alpaca 中文和英文的 int4 模型）
+- 可以兼容多种模型格式（支持 alpaca 中文和英文的 int4 模型）
 - 目前只支持 CPU，主要是 Arm 和 x86 平台，可以在手机上部署，速度在可以接受的范围
 
 总之 InferLLM 是一个简单高效的 LLM CPU 推理框架，可以本地部署 LLM 中的量化模型，推理速度还不错。
 
 ## 如何使用
 ### 下载模型
-目前 InferLLM 使用的模型和 llama.cpp 的模型是一样的，可以在 llama.cpp 工程中下载模型。另外也可以直接从 Hugging Face 中 [kewin4933/InferLLM-Model](https://huggingface.co/kewin4933/InferLLM-Model/tree/main) 下载模型，目前在这个工程中上传了两个 alpaca 的模型，一个是中的 int4 模型，一个是英文的 int4 模型。
+目前 InferLLM 使用的模型和 llama.cpp 的模型是一样的，可以在 llama.cpp 工程中下载模型。另外也可以直接从 Hugging Face 中 [kewin4933/InferLLM-Model](https://huggingface.co/kewin4933/InferLLM-Model/tree/main) 下载模型，目前在这个工程中上传了两个 alpaca 的模型，一个是中文 int4 模型，一个是英文 int4 模型。
 ### 编译 InferLLM
 #### 本地编译
 ```shell

diff --git a/docs/profile.md b/docs/profile.md
@@ -0,0 +1,49 @@
+# 模型测速
+
+## 测试方法
+
+考虑到 alpaca 和 LLaMa 的推理过程相同，且中文版仅仅是权重做了调整。我们仅测试 alpaca 中文版和 ChatGLM，结果适用英文模型。
+
+## alpaca 结果
+
+1. 硬件 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
+
+    | 模型 | 生成速度(token/s) | 线程数 |
+    | :-: | :-: | :-: |
+    | chinese-alpaca-7b-q4 | 3.2 | 1 |
+    | chinese-alpaca-7b-q4 | 9.2 | 4 |
+    | chinese-alpaca-7b-q4 | 10 | 8 |
+    | chinese-alpaca-7b-q4 | 9.8 | 16 |
+
+2. 硬件 [AMD EPYC 7742 64-Core @ 2.25GHz](https://www.amd.com/zh-hant/products/cpu/amd-epyc-7742)
+
+    | 模型 | 生成速度(token/s) | 线程数 |
+    | :-: | :-: | :-: |
+    | chinese-alpaca-7b-q4 | 2.3 | 1 |
+    | chinese-alpaca-7b-q4 | 7.3 | 4 |
+    | chinese-alpaca-7b-q4 | 10.5 | 8 |
+    | chinese-alpaca-7b-q4 | 10.7 | 16 |
+    | chinese-alpaca-7b-q4 | 11.2 | 32 |
+    | chinese-alpaca-7b-q4 | 12.7 | 64 |
+
+## ChatGLM 结果
+
+1. 硬件 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
+
+    | 模型 | 生成速度(token/s) | 线程数 |
+    | :-: | :-: | :-: |
+    | chatglm-q4 | 3.2 | 1 |
+    | chatglm-q4 | 8.0 | 4 |
+    | chatglm-q4 | 8.9 | 8 |
+    | chatglm-q4 | 7.3 | 16 |
+
+2. 硬件 AMD EPYC 7742 64-Core @ 2.25GHz
+
+    | 模型 | 生成速度(token/s) | 线程数 |
+    | :-: | :-: | :-: |
+    | chatglm-q4 | 2.4 | 1 |
+    | chatglm-q4 | 5.8 | 4 |
+    | chatglm-q4 | 8.9 | 8 |
+    | chatglm-q4 | 9.1 | 16 |
+    | chatglm-q4 | 11.6 | 32 |
+    | chatglm-q4 | 11.7 | 64 |
diff --git a/src/core/model_imp.cpp b/src/core/model_imp.cpp
@@ -136,5 +136,7 @@ std::string ModelImp::decode_summary() const {
     ret += "Total Model Compute Token:  " + std::to_string(m_past) + "\n";
     ret += "Average Token Compute Time: " +
            std::to_string(m_time_cost * 1000 / m_past) + "ms\n";
+    ret += "Average Token Generation Speed: " +
+           std::to_string(m_past / m_time_cost) + "token/s\n";
     return ret;
 }