ChatLLM.cpp

Inference of a bunch of models from less than 3B to more than 45B, for real-time chatting on your computer (CPU), pure C++ implementation based on @ggerganov's ggml:

LlaMA-like (LlamaForCausalLM):
- All LlaMA-1 models
- LlaMA-2: Chat-7B, etc
- CodeLlaMA: Instruct-7B
- DeepSeek: Chat-7B, Coder-6.7B 🔥
- Yi: Chat-6B, Chat-34B
- WizardLM: LM 7B, LM 13B, Coder Python-7B
- TigerBot: Chat-7B, Chat-13B
Baichuan (BaichuanForCausalLM)
- Chat-7B, Chat-13B
ChatGLM (ChatGLMModel):
- ChatGLM: 6B
- ChatGLM2 family: ChatGLM2 6B, CodeGeeX2 6B, ChatGLM3 6B
  
  Tip on CodeGeeX2: Code completion only, no context. Use system prompt to specify language, e.g. -s "# language: python".
InternLM (InternLMForCausalLM)
- Chat-7B, Chat-7B v1.1
- Chat-20B
Mistral (MistralForCausalLM, MixtralForCausalLM)
- Mistral: Instruct-7B, 7B
- OpenChat: 3.5 🔥
  
  Tip: Use system prompt to select modes: -s GPT4 (default mode), -s Math (mathematical reasoning mode).
- WizardLM: Math 7B
- Mixtral: Instruct-8x7B 🔥
  
  Two implementations of sliding-window attention (see SlidingWindowAttentionImpl):
  - Full cache: more RAM is needed (default).
  - Ring cache (i.e. rolling cache): less RAM, but current implementation is naive (slow). 💣
- NeuralBeagle14: 7B
Phi (PhiForCausalLM)
- Phi-2 🔥
  
  Tip: --temp 0 is recommended. Don't forget to try --format qa.
- Dolphin Phi-2 🐬
QWenLM (QWenLMHeadModel)
- Chat-7B, Chat-14B
BlueLM (BlueLMForCausalLM)
- Chat-7B, Chat-7B 32K
Stable-LM (StableLMEpochModel)
- Code-3B
  
  Note: This model is an autocompletion model, not a chat/instruction model, so please use --format completion.

Features

Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;
Use OOP to address the similarities between different Transformer based models;
Streaming generation with typewriter effect;
Continuous chatting (content length is virtually unlimited)

Two methods are available: Restart and Shift. See --extending options.
LoRA;
Python binding, web demo, and more possibilities.

Preparation

Clone the ChatLLM.cpp repository into your local machine:

git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatllm.cpp folder:

git submodule update --init --recursive

Quantize Model

Use convert.py to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:

# DeepSeek LLM Chat models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DeepSeek

# DeepSeek Coder models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DeepSeekCoder

# CodeLlaMA models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA

# Yi models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a Yi

# WizardCoder models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardCoder

# WizardLM models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardLM

# WizardMath models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardMath

# OpenChat models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a OpenChat

# TigerBot models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a TigerBot

# Dolphin (based on Phi-2) models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DolphinPhi2

# For other models, such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin

Note: Only HF format is supported; Format of the generated .bin files is different from the one (GGUF) used by llama.cpp.

Build & Run

Compile the project using CMake:

cmake -B build
# On Linux, WSL:
cmake --build build -j
# On Windows with MSVC:
cmake --build build -j --config Release

Now you may chat with a quantized model by running:

./build/bin/main -m chatglm-ggml.bin                            # ChatLLM-6B
# 你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。
./build/bin/main -m chatglm2-ggml.bin --top_p 0.8 --temp 0.8    # ChatLLM2-6B
# 你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。
./build/bin/main -m llama2.bin  --seed 100                      # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....

To run the model in interactive mode, add the -i flag. For example:

# On Windows
.\build\bin\Release\main -m model.bin -i

# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Acknowledgements

This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.
Thank those who have released their the model sources and checkpoints.

Note

This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
images		images
models		models
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
chat.cpp		chat.cpp
chat.h		chat.h
convert.py		convert.py
custom_ops.cpp		custom_ops.cpp
layers.cpp		layers.cpp
layers.h		layers.h
main.cpp		main.cpp
models.cpp		models.cpp
models.h		models.h
requirements.txt		requirements.txt
tokenizer.cpp		tokenizer.cpp
tokenizer.h		tokenizer.h
unicode.h		unicode.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChatLLM.cpp

Features

Acknowledgements

Note

About

Uh oh!

Releases

Packages

Languages

License

dingmaotu/chatllm.cpp

Folders and files

Latest commit

History

Repository files navigation

ChatLLM.cpp

Features

Acknowledgements

Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages