sliced_llama

SlicedLlama is AI text generation software powered by exllamav2. It runs large language models (LLMs) and offers an interface to generate text, adjust model parameters, and modify the model on the fly by rearranging layers. A webUI is included, but it's also a server compatible with other LLM GUIs.

Features

Text Completion WebUI
partly OpenAI-compatible API (this is a work in progress)
Layer Slicing: Basically instant Franken-self-merges. You don't even need to reload the model (just the cache).
Top Logprobs: See the top probabilities for each chosen token. This might help with adjusting sampler parameters.

Installation

Make sure python is installed, and a CUDA or RocM (Linux only) compatible GPU driver.
Clone or download this repository.
Use the setup script. This creates a venv and installs dependencies.

git clone --depth=1 https://github.com/silphendio/sliced_llama
cd sliced_llama
python ./setup.py

DISCLAIMER: I haven't tested it on windows at all.

Usage

On Linux, just run it with

./sliced_llama_server.py

On Windows, click start.bat instead. (It invokes .venv\Scripts\python sliced_llama_server.py)

This starts the inference server and the webUI. There, you can load models, adjust parameters and do inference. You can also use command line arguments, e.g.:

./sliced_llama_server.py --model ~/path/to/llm-model-exl2/ --context-size 2048 --slices "0-24, 8-32"

Currently only exl2 models are supported. You can get them from huggingface. Make sure that the model fits into VRAM, with some extra memory depending on (context size)² * (number of layers), where context size is the number of tokens the model can remember.

The WebUI currently only supports text completion, so you need to do the prompt formatting yourself. Each model has its preferred prompt format, so look it up.

Compatibility with other apps:

As an alternative to the webUI, the server can also connect to OpenAI-compatible GUIs like Mikupad or SillyTavern.

For SillyTavern, select chat completion, and use http://127.0.0.1:57593/v1 as costum endpoint. This will not give you many options, but if you change parameters in the WebUI, the inference server should remember them. You can select different chat templates in the WebUI. You can add more to the chat_templates folder.

TODO / missing features

In no particular order:

configuration file
LoRA support
Classifier Free Guidance
OpenAI API:
- chat completion currently only works with streaming
- presency_penalty and frequency_penalty aren't supported
- authentication
- usage statistics
compatibility with TabbyAPI (For better SillyTavern integration)
merging different models together
different merging methods

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
chat_templates		chat_templates
screenshots		screenshots
webui		webui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.py		api.py
pyvenv.cfg		pyvenv.cfg
serve_chat_completions.py		serve_chat_completions.py
serve_completions.py		serve_completions.py
setup.py		setup.py
sliced_llama.py		sliced_llama.py
sliced_llama_exl2.py		sliced_llama_exl2.py
sliced_llama_server.py		sliced_llama_server.py
start.bat		start.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sliced_llama

Features

Installation

Usage

Compatibility with other apps:

TODO / missing features

About

Releases

Packages

Languages

License

silphendio/sliced_llama

Folders and files

Latest commit

History

Repository files navigation

sliced_llama

Features

Installation

Usage

Compatibility with other apps:

TODO / missing features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages