llm
is an ecosystem of Rust libraries for working with large language models -
its built on top of the fast, efficient GGML library for
machine learning.
Image by @darthdeus, using Stable Diffusion
The primary entrypoint for developers is
the llm
crate, which wraps llm-base
and
the supported model crates.
For end-users, there is a CLI application,
llm-cli
, which provides a convenient interface for
interacting with supported models. Text generation can be done as a
one-off based on a prompt, or interactively, through
REPL or chat modes. The CLI can also be
used to serialize (print) decoded models,
quantize GGML files, or compute the
perplexity of a model. It
can be downloaded from
the latest GitHub release or by
installing it from crates.io
.
llm
is powered by the ggml
tensor
library, and aims to bring the robustness and ease of use of Rust to the world
of large language models. At present, inference is only on the CPU, but we hope
to support GPU inference in the future through alternate backends.
Currently, the following models are supported:
- BLOOM
- GPT-2
- GPT-J
- GPT-NeoX: GPT-NeoX, StableLM, RedPajama, Dolly v2
- LLaMA: LLaMA, Alpaca, Vicuna, Koala, GPT4All v1, GPT4-X, Wizard
- MPT
This project depends on Rust v1.65.0 or above and a modern C toolchain.
The llm
crate exports llm-base
and the model crates (e.g. bloom
, gpt2
llama
).
To use llm
, add it to your Cargo.toml
:
[dependencies]
llm = "0.2"
NOTE: To improve debug performance, exclude llm
from being built in debug
mode:
[profile.dev.package.llm]
opt-level = 3
Follow these steps to build the command line application, which is named llm
:
To install llm
to your Cargo bin
directory, which rustup
is likely to have
added to your PATH
, run:
cargo install llm-cli
The CLI application can then be run through llm
.
Clone the repository and then build it with
git clone --recurse-submodules git@github.com:rustformers/llm.git
cargo build --release
The resulting binary will be at target/release/llm[.exe]
.
It can also be run directly through Cargo, with
cargo run --release -- $ARGS
GGML files are easy to acquire. For a list of models that have been tested, see the known-good models.
Certain older GGML formats are not supported by this project, but the goal is to maintain feature parity with the upstream GGML project. For problems relating to loading models, or requesting support for supported GGML model types, please open an Issue.
Hugging Face 🤗 is a leader in open-source machine learning and hosts hundreds of GGML models. Search for GGML models on Hugging Face 🤗.
This Reddit community maintains a wiki related to GGML models, including well organized lists of links for acquiring GGML models (mostly from Hugging Face 🤗).
Once the llm
executable has been built or is in a $PATH
directory, try
running it. Here's an example that uses the open-source
GPT4All language
model:
llm llama infer -m ggml-gpt4all-j-v1.3-groovy.bin -p "Rust is a cool programming language because"
For more information about the llm
CLI, use the --help
parameter.
There is also a simple inference example that is helpful for debugging:
cargo run --release --example inference llama ggml-gpt4all-j-v1.3-groovy.bin $OPTIONAL_PROMPT
Python v3.9 or v3.10 is needed to convert a raw model to a GGML-compatible format (note that Python v3.11 is not supported):
python3 util/convert-pth-to-ggml.py $MODEL_HOME/$MODEL/7B/ 1
The output of the above command can be used by llm
to create a
quantized model:
cargo run --release llama quantize $MODEL_HOME/$MODEL/7B/ggml-model-f16.bin $MODEL_HOME/$MODEL/7B/ggml-model-q4_0.bin q4_0
In future, we hope to provide a more streamlined way of converting models.
Note
The llama.cpp repository has additional information on how to obtain and run specific models.
Yes, but certain fine-tuned models (e.g.
Alpaca,
Vicuna,
Pygmalion) are more more suited to chat use-cases
than so-called "base models". Here's an example of using the llm
CLI in REPL
(Read-Evaluate-Print Loop) mode with an Alpaca model - note that the
provided prompt format is tailored to the model
that is being used:
llm llama repl -m ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt
There is also a Vicuna chat example that demonstrates how to create a custom chatbot:
cargo run --release --example vicuna-chat llama ggml-vicuna-7b-q4.bin
Sessions can be loaded (--load-session
) or saved (--save-session
) to file.
To automatically load and save the same session, use --persist-session
. This
can be used to cache prompts to reduce load time, too.
The llm
Dockerfile is in the util
directory, as is a
Flake manifest and lockfile.
Absolutely! Please see the contributing guide.
- llmcord: Discord bot for generating
messages using
llm
. - local.ai: Desktop app for hosting an
inference API on your local machine using
llm
.
- llm-chain: Build chains in large language models for text summarization and completion of more complex tasks