Inference of a bunch of models from less than 3B to more than 45B, for real-time chatting on your computer (CPU), pure C++ implementation based on @ggerganov's ggml:
-
LlaMA-like (
LlamaForCausalLM
):- All LlaMA-1 models
- LlaMA-2: Chat-7B, etc
- CodeLlaMA: Instruct-7B
- DeepSeek: Chat-7B, Coder-6.7B 🔥
- Yi: Chat-6B, Chat-34B
- WizardLM: LM 7B, LM 13B, Coder Python-7B
- TigerBot: Chat-7B, Chat-13B
-
Baichuan (
BaichuanForCausalLM
) -
ChatGLM (
ChatGLMModel
):-
ChatGLM: 6B
-
ChatGLM2 family: ChatGLM2 6B, CodeGeeX2 6B, ChatGLM3 6B
Tip on CodeGeeX2: Code completion only, no context. Use system prompt to specify language, e.g.
-s "# language: python"
.
-
-
InternLM (
InternLMForCausalLM
) -
Mistral (
MistralForCausalLM
,MixtralForCausalLM
)-
Mistral: Instruct-7B, 7B
-
OpenChat: 3.5 🔥
Tip: Use system prompt to select modes:
-s GPT4
(default mode),-s Math
(mathematical reasoning mode). -
WizardLM: Math 7B
-
Mixtral: Instruct-8x7B 🔥
Two implementations of sliding-window attention (see
SlidingWindowAttentionImpl
):- Full cache: more RAM is needed (default).
- Ring cache (i.e. rolling cache): less RAM, but current implementation is naive (slow). 💣
-
NeuralBeagle14: 7B
-
-
Phi (
PhiForCausalLM
)-
Phi-2 🔥
Tip:
--temp 0
is recommended. Don't forget to try--format qa
.
-
-
QWenLM (
QWenLMHeadModel
) -
BlueLM (
BlueLMForCausalLM
) -
Stable-LM (
StableLMEpochModel
)-
Note: This model is an autocompletion model, not a chat/instruction model, so please use
--format completion
.
-
-
Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;
-
Use OOP to address the similarities between different Transformer based models;
-
Streaming generation with typewriter effect;
-
Continuous chatting (content length is virtually unlimited)
Two methods are available: Restart and Shift. See
--extending
options. -
LoRA;
-
Python binding, web demo, and more possibilities.
Preparation
Clone the ChatLLM.cpp repository into your local machine:
git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp
If you forgot the --recursive
flag when cloning the repository, run the following command in the chatllm.cpp
folder:
git submodule update --init --recursive
Quantize Model
Use convert.py
to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:
# DeepSeek LLM Chat models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DeepSeek
# DeepSeek Coder models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DeepSeekCoder
# CodeLlaMA models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA
# Yi models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a Yi
# WizardCoder models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardCoder
# WizardLM models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardLM
# WizardMath models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardMath
# OpenChat models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a OpenChat
# TigerBot models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a TigerBot
# Dolphin (based on Phi-2) models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DolphinPhi2
# For other models, such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin
Note: Only HF format is supported; Format of the generated .bin
files is different from the one (GGUF) used by llama.cpp
.
Build & Run
Compile the project using CMake:
cmake -B build
# On Linux, WSL:
cmake --build build -j
# On Windows with MSVC:
cmake --build build -j --config Release
Now you may chat with a quantized model by running:
./build/bin/main -m chatglm-ggml.bin # ChatLLM-6B
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m chatglm2-ggml.bin --top_p 0.8 --temp 0.8 # ChatLLM2-6B
# 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m llama2.bin --seed 100 # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....
To run the model in interactive mode, add the -i
flag. For example:
# On Windows
.\build\bin\Release\main -m model.bin -i
# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i
In interactive mode, your chat history will serve as the context for the next-round conversation.
Run ./build/bin/main -h
to explore more options!
-
This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.
-
Thank those who have released their the model sources and checkpoints.
This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.