SlicedLlama is AI text generation software powered by exllamav2. It runs large language models (LLMs) and offers an interface to generate text, adjust model parameters, and modify the model on the fly by rearranging layers. A webUI is included, but it's also a server compatible with other LLM GUIs.
- Text Completion WebUI
- partly OpenAI-compatible API (this is a work in progress)
- Layer Slicing: Basically instant Franken-self-merges. You don't even need to reload the model (just the cache).
- Top Logprobs: See the top probabilities for each chosen token. This might help with adjusting sampler parameters.
- Make sure python is installed, and a CUDA or RocM (Linux only) compatible GPU driver.
- Clone or download this repository.
- Use the setup script. This creates a venv and installs dependencies.
git clone --depth=1 https://github.com/silphendio/sliced_llama
cd sliced_llama
python ./setup.py
DISCLAIMER: I haven't tested it on windows at all.
On Linux, just run it with
./sliced_llama_server.py
On Windows, click start.bat
instead. (It invokes .venv\Scripts\python sliced_llama_server.py
)
This starts the inference server and the webUI. There, you can load models, adjust parameters and do inference. You can also use command line arguments, e.g.:
./sliced_llama_server.py --model ~/path/to/llm-model-exl2/ --context-size 2048 --slices "0-24, 8-32"
Currently only exl2 models are supported. You can get them from huggingface.
Make sure that the model fits into VRAM, with some extra memory depending on (context size)² * (number of layers)
, where context size is the number of tokens the model can remember.
The WebUI currently only supports text completion, so you need to do the prompt formatting yourself. Each model has its preferred prompt format, so look it up.
As an alternative to the webUI, the server can also connect to OpenAI-compatible GUIs like Mikupad or SillyTavern.
- For SillyTavern, select chat completion, and use
http://127.0.0.1:57593/v1
as costum endpoint. This will not give you many options, but if you change parameters in the WebUI, the inference server should remember them. You can select different chat templates in the WebUI. You can add more to thechat_templates
folder.
In no particular order:
- configuration file
- LoRA support
- Classifier Free Guidance
- OpenAI API:
- chat completion currently only works with streaming
presency_penalty
andfrequency_penalty
aren't supported- authentication
- usage statistics
- compatibility with TabbyAPI (For better SillyTavern integration)
- merging different models together
- different merging methods