Skip to content

Pure C++ implementation of several models for real-time chatting on your computer (CPU)

License

Notifications You must be signed in to change notification settings

dingmaotu/chatllm.cpp

 
 

Repository files navigation

ChatLLM.cpp

License: MIT

Inference of a bunch of models from less than 3B to more than 45B, for real-time chatting on your computer (CPU), pure C++ implementation based on @ggerganov's ggml:

  • LlaMA-like (LlamaForCausalLM):

  • Baichuan (BaichuanForCausalLM)

  • ChatGLM (ChatGLMModel):

    • ChatGLM: 6B

    • ChatGLM2 family: ChatGLM2 6B, CodeGeeX2 6B, ChatGLM3 6B

      Tip on CodeGeeX2: Code completion only, no context. Use system prompt to specify language, e.g. -s "# language: python".

  • InternLM (InternLMForCausalLM)

  • Mistral (MistralForCausalLM, MixtralForCausalLM)

    • Mistral: Instruct-7B, 7B

    • OpenChat: 3.5 🔥

      Tip: Use system prompt to select modes: -s GPT4 (default mode), -s Math (mathematical reasoning mode).

    • WizardLM: Math 7B

    • Mixtral: Instruct-8x7B 🔥

      Two implementations of sliding-window attention (see SlidingWindowAttentionImpl):

      • Full cache: more RAM is needed (default).
      • Ring cache (i.e. rolling cache): less RAM, but current implementation is naive (slow). 💣
    • NeuralBeagle14: 7B

  • Phi (PhiForCausalLM)

    • Phi-2 🔥

      Tip: --temp 0 is recommended. Don't forget to try --format qa.

    • Dolphin Phi-2 🐬

  • QWenLM (QWenLMHeadModel)

  • BlueLM (BlueLMForCausalLM)

  • Stable-LM (StableLMEpochModel)

    • Code-3B

      Note: This model is an autocompletion model, not a chat/instruction model, so please use --format completion.

Features

  • Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;

  • Use OOP to address the similarities between different Transformer based models;

  • Streaming generation with typewriter effect;

  • Continuous chatting (content length is virtually unlimited)

    Two methods are available: Restart and Shift. See --extending options.

  • LoRA;

  • Python binding, web demo, and more possibilities.

Preparation

Clone the ChatLLM.cpp repository into your local machine:

git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatllm.cpp folder:

git submodule update --init --recursive

Quantize Model

Use convert.py to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:

# DeepSeek LLM Chat models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DeepSeek

# DeepSeek Coder models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DeepSeekCoder

# CodeLlaMA models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA

# Yi models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a Yi

# WizardCoder models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardCoder

# WizardLM models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardLM

# WizardMath models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a WizardMath

# OpenChat models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a OpenChat

# TigerBot models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a TigerBot

# Dolphin (based on Phi-2) models
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a DolphinPhi2

# For other models, such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin

Note: Only HF format is supported; Format of the generated .bin files is different from the one (GGUF) used by llama.cpp.

Build & Run

Compile the project using CMake:

cmake -B build
# On Linux, WSL:
cmake --build build -j
# On Windows with MSVC:
cmake --build build -j --config Release

Now you may chat with a quantized model by running:

./build/bin/main -m chatglm-ggml.bin                            # ChatLLM-6B
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m chatglm2-ggml.bin --top_p 0.8 --temp 0.8    # ChatLLM2-6B
# 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m llama2.bin  --seed 100                      # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....

To run the model in interactive mode, add the -i flag. For example:

# On Windows
.\build\bin\Release\main -m model.bin -i

# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Acknowledgements

  • This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.

  • Thank those who have released their the model sources and checkpoints.

Note

This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.

About

Pure C++ implementation of several models for real-time chatting on your computer (CPU)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 82.9%
  • Python 16.3%
  • CMake 0.8%