Skip to content

Commit e07c6a5

Browse files
ggerganovtinglou
authored andcommitted
readme : update the usage section with examples (ggml-org#10596)
* readme : update the usage section with examples * readme : more examples
1 parent 617d880 commit e07c6a5

File tree

1 file changed

+202
-74
lines changed

1 file changed

+202
-74
lines changed

README.md

Lines changed: 202 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,9 @@ The `llama.cpp` project is the main playground for developing new features for t
4242

4343
Typically finetunes of the base models below are supported as well.
4444

45-
Instructions for adding support for new models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md)
45+
Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)
4646

47-
**Text-only:**
47+
#### Text-only
4848

4949
- [X] LLaMA 🦙
5050
- [x] LLaMA 2 🦙🦙
@@ -99,7 +99,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](./docs/deve
9999
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
100100
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
101101

102-
**Multimodal:**
102+
#### Multimodal
103103

104104
- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
105105
- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
@@ -213,27 +213,27 @@ Instructions for adding support for new models: [HOWTO-add-model.md](./docs/deve
213213

214214
| Backend | Target devices |
215215
| --- | --- |
216-
| [Metal](./docs/build.md#metal-build) | Apple Silicon |
217-
| [BLAS](./docs/build.md#blas-build) | All |
218-
| [BLIS](./docs/backend/BLIS.md) | All |
219-
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
220-
| [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
221-
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
222-
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
223-
| [Vulkan](./docs/build.md#vulkan) | GPU |
224-
| [CANN](./docs/build.md#cann) | Ascend NPU |
225-
226-
## Building and usage
216+
| [Metal](docs/build.md#metal-build) | Apple Silicon |
217+
| [BLAS](docs/build.md#blas-build) | All |
218+
| [BLIS](docs/backend/BLIS.md) | All |
219+
| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
220+
| [MUSA](docs/build.md#musa) | Moore Threads MTT GPU |
221+
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
222+
| [hipBLAS](docs/build.md#hipblas) | AMD GPU |
223+
| [Vulkan](docs/build.md#vulkan) | GPU |
224+
| [CANN](docs/build.md#cann) | Ascend NPU |
225+
226+
## Building the project
227227

228228
The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
229229
The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:
230230

231-
- Clone this repository and build locally, see [how to build](./docs/build.md)
232-
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](./docs/install.md)
233-
- Use a Docker image, see [documentation for Docker](./docs/docker.md)
231+
- Clone this repository and build locally, see [how to build](docs/build.md)
232+
- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
233+
- Use a Docker image, see [documentation for Docker](docs/docker.md)
234234
- Download pre-built binaries from [releases](https://github.com/ggerganov/llama.cpp/releases)
235235

236-
### Obtaining and quantizing models
236+
## Obtaining and quantizing models
237237

238238
The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:
239239

@@ -251,79 +251,204 @@ The Hugging Face platform provides a variety of online tools for converting, qua
251251
- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggerganov/llama.cpp/discussions/9268)
252252
- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)
253253

254-
To learn more about model quantization, [read this documentation](./examples/quantize/README.md)
254+
To learn more about model quantization, [read this documentation](examples/quantize/README.md)
255255

256-
### Using the `llama-cli` tool
256+
## [`llama-cli`](examples/main)
257257

258-
Run a basic text completion:
258+
#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
259259

260-
```bash
261-
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
260+
- <details open>
261+
<summary>Run simple text completion</summary>
262262

263-
# Output:
264-
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
265-
```
263+
```bash
264+
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128
266265

267-
See [this page](./examples/main/README.md) for a full list of parameters.
266+
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
267+
```
268268

269-
### Conversation mode
269+
</details>
270270

271-
Run `llama-cli` in conversation/chat mode by passing the `-cnv` parameter:
271+
- <details>
272+
<summary>Run in conversation mode</summary>
272273

273-
```bash
274-
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
274+
```bash
275+
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv
275276
276-
# Output:
277-
# > hi, who are you?
278-
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
279-
#
280-
# > what is 1+1?
281-
# Easy peasy! The answer to 1+1 is... 2!
282-
```
277+
# > hi, who are you?
278+
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
279+
#
280+
# > what is 1+1?
281+
# Easy peasy! The answer to 1+1 is... 2!
282+
```
283283

284-
By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
284+
</details>
285285

286-
```bash
287-
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
288-
```
286+
- <details>
287+
<summary>Run with custom chat template</summary>
289288

290-
You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
289+
```bash
290+
# use the "chatml" template
291+
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
291292
292-
```bash
293-
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
294-
```
293+
# use a custom template
294+
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
295+
```
295296

296-
### Constrained output with grammars
297+
[Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
297298

298-
`llama.cpp` can constrain the output of the model via custom grammars. For example, you can force the model to output only JSON:
299+
</details>
299300

300-
```bash
301-
llama-cli -m your_model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
302-
```
301+
- <details>
302+
<summary>Constrain the output with a custom grammar</summary>
303303

304-
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
304+
```bash
305+
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
305306
306-
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
307+
# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
308+
```
307309

308-
### Web server (`llama-server`)
310+
The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).
309311

310-
The [llama-server](./examples/server/README.md) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
312+
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
311313

312-
Example usage:
314+
</details>
313315

314-
```bash
315-
llama-server -m your_model.gguf --port 8080
316316

317-
# Basic web UI can be accessed via browser: http://localhost:8080
318-
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
319-
```
317+
## [`llama-server`](examples/server)
320318

321-
### Perplexity (measuring model quality)
319+
#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
322320

323-
Use the `llama-perplexity` tool to measure perplexity over a given prompt (lower perplexity is better).
324-
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
321+
- <details open>
322+
<summary>Start a local HTTP server with default configuration on port 8080</summary>
323+
324+
```bash
325+
llama-server -m model.gguf --port 8080
326+
327+
# Basic web UI can be accessed via browser: http://localhost:8080
328+
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
329+
```
330+
331+
</details>
332+
333+
- <details>
334+
<summary>Support multiple-users and parallel decoding</summary>
335+
336+
```bash
337+
# up to 4 concurrent requests, each with 4096 max context
338+
llama-server -m model.gguf -c 16384 -np 4
339+
```
340+
341+
</details>
342+
343+
- <details>
344+
<summary>Enable speculative decoding</summary>
345+
346+
```bash
347+
# the draft.gguf model should be a small variant of the target model.gguf
348+
llama-server -m model.gguf -md draft.gguf
349+
```
350+
351+
</details>
352+
353+
- <details>
354+
<summary>Serve an embedding model</summary>
355+
356+
```bash
357+
# use the /embedding endpoint
358+
llama-server -m model.gguf --embedding --pooling cls -ub 8192
359+
```
360+
361+
</details>
362+
363+
- <details>
364+
<summary>Serve a reranking model</summary>
365+
366+
```bash
367+
# use the /reranking endpoint
368+
llama-server -m model.gguf --reranking
369+
```
370+
371+
</details>
372+
373+
- <details>
374+
<summary>Constrain all outputs with a grammar</summary>
375+
376+
```bash
377+
# custom grammar
378+
llama-server -m model.gguf --grammar-file grammar.gbnf
379+
380+
# JSON
381+
llama-server -m model.gguf --grammar-file grammars/json.gbnf
382+
```
383+
384+
</details>
385+
386+
387+
## [`llama-perplexity`](examples/perplexity)
388+
389+
#### A tool for measuring the perplexity [^1][^2] (and other quality metrics) of a model over a given text.
390+
391+
- <details open>
392+
<summary>Measure the perplexity over a text file</summary>
393+
394+
```bash
395+
llama-perplexity -m model.gguf -f file.txt
396+
397+
# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
398+
# Final estimate: PPL = 5.4007 +/- 0.67339
399+
```
400+
401+
</details>
402+
403+
- <details>
404+
<summary>Measure KL divergence</summary>
405+
406+
```bash
407+
# TODO
408+
```
409+
410+
</details>
411+
412+
[^1]: [examples/perplexity/README.md](examples/perplexity/README.md)
413+
[^2]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
414+
415+
## [`llama-bench`](example/bench)
416+
417+
#### Benchmark the performance of the inference for various parameters.
418+
419+
- <details open>
420+
<summary>Run default benchmark</summary>
421+
422+
```bash
423+
llama-bench -m model.gguf
424+
425+
# Output:
426+
# | model | size | params | backend | threads | test | t/s |
427+
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
428+
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 ± 20.55 |
429+
# | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 ± 0.81 |
430+
#
431+
# build: 3e0ba0e60 (4229)
432+
```
433+
434+
</details>
435+
436+
437+
## [`llama-simple`](examples/simple)
438+
439+
#### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
440+
441+
- <details>
442+
<summary>Basic text completion</summary>
443+
444+
```bash
445+
llama-simple -m model.gguf
446+
447+
# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
448+
```
449+
450+
</details>
325451

326-
To learn more how to measure perplexity using llama.cpp, [read this documentation](./examples/perplexity/README.md)
327452

328453
## Contributing
329454

@@ -338,19 +463,19 @@ To learn more how to measure perplexity using llama.cpp, [read this documentatio
338463

339464
## Other documentation
340465

341-
- [main (cli)](./examples/main/README.md)
342-
- [server](./examples/server/README.md)
343-
- [GBNF grammars](./grammars/README.md)
466+
- [main (cli)](examples/main/README.md)
467+
- [server](examples/server/README.md)
468+
- [GBNF grammars](grammars/README.md)
344469

345-
**Development documentation**
470+
#### Development documentation
346471

347-
- [How to build](./docs/build.md)
348-
- [Running on Docker](./docs/docker.md)
349-
- [Build on Android](./docs/android.md)
350-
- [Performance troubleshooting](./docs/development/token_generation_performance_tips.md)
472+
- [How to build](docs/build.md)
473+
- [Running on Docker](docs/docker.md)
474+
- [Build on Android](docs/android.md)
475+
- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
351476
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
352477

353-
**Seminal papers and background on the models**
478+
#### Seminal papers and background on the models
354479

355480
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
356481
- LLaMA:
@@ -361,3 +486,6 @@ If your issue is with model generation quality, then please at least scan the fo
361486
- GPT-3.5 / InstructGPT / ChatGPT:
362487
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
363488
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
489+
490+
#### References
491+

0 commit comments

Comments
 (0)