This project forks from cyllama and provides a Python wrapper for @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine.
The following table provide an overview of the current implementations / features:
implementations / features | xllamacpp | llama-cpp-python |
---|---|---|
Wrapper-type | cython | ctypes |
API | Server & Params API | Llama API |
Server implementation | C++ | Python through wrapped LLama API |
Continuous batching | yes | no |
Thread safe | yes | no |
It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!
As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:
-
In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.
-
Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.
-
Minimize non-wrapper python code.
- From pypi for
CPU
orMac
:
pip install -U xllamacpp
- From github pypi for
CUDA
(use--force-reinstall
to replace the installed CPU version):
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124
To build xllamacpp
:
-
A recent version of
python3
(testing on python 3.12) -
Git clone the latest version of
xllamacpp
:
git clone git@github.com:xorbitsai/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
- Install dependencies of
cython
,setuptools
, andpytest
for testing:
pip install -r requirements.txt
- Type
make
in the terminal.
The tests
directory in this repo provides extensive examples of using xllamacpp.
However, as a first step, you should download a smallish llm in the .gguf
model from huggingface. A good model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. xllamacpp
expects models to be stored in a models
folder in the cloned xllamacpp
directory. So to create the models
directory if doesn't exist and download this model, you can just type:
make download
This basically just does:
cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
Now you can test it using llama-cli
or llama-simple
:
bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
-p "Is mathematics discovered or invented?"
You can also run the test suite with pytest
by typing pytest
or:
make test