Skip to content

Commit 504f9e7

Browse files
Merge pull request #103 from bernardladenthin/claude/review-readme-docs-biT1j
Expand README with features section and API documentation
2 parents fffa979 + 524428a commit 504f9e7

2 files changed

Lines changed: 79 additions & 17 deletions

File tree

README.md

Lines changed: 78 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,29 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
77

88
**You are welcome to contribute**
99

10-
1. [Quick Start](#quick-start)
11-
1.1 [No Setup required](#no-setup-required)
12-
1.2 [Setup required](#setup-required)
13-
2. [Documentation](#documentation)
14-
2.1 [Example](#example)
15-
2.2 [Inference](#inference)
16-
2.3 [Infilling](#infilling)
17-
3. [Android](#importing-in-android)
18-
19-
> [!NOTE]
20-
> Now with support for Gemma 3 and Gemma 4
21-
22-
## Download
23-
24-
[![](https://img.shields.io/badge/download-class.jar-blue)](dist/llama-4.2.0.jar)
10+
1. [Features](#features)
11+
2. [Quick Start](#quick-start)
12+
2.1 [No Setup required](#no-setup-required)
13+
2.2 [Setup required](#setup-required)
14+
3. [Documentation](#documentation)
15+
3.1 [Example](#example)
16+
3.2 [Inference](#inference)
17+
3.3 [Chat Completion](#chat-completion)
18+
3.4 [Infilling](#infilling)
19+
3.5 [Embeddings & Reranking](#embeddings--reranking)
20+
3.6 [Raw JSON Endpoints](#raw-json-endpoints)
21+
4. [Android](#importing-in-android)
22+
23+
## Features
24+
25+
- Text completion (blocking and streaming) with full control over sampling parameters.
26+
- OpenAI-compatible **chat completion** with automatic chat-template application, including streaming and tool/function calling support via the upstream server.
27+
- **Embeddings** and **reranking** for retrieval pipelines.
28+
- **Infilling** (fill-in-the-middle) for code models.
29+
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
30+
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
31+
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
32+
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
2533

2634
## Quick Start
2735

@@ -31,7 +39,7 @@ Access this library via Maven:
3139
<dependency>
3240
<groupId>net.ladenthin</groupId>
3341
<artifactId>llama</artifactId>
34-
<version>4.2.0</version>
42+
<version>5.0.0-SNAPSHOT</version>
3543
</dependency>
3644
```
3745

@@ -162,10 +170,64 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
162170
> freed when the model is no longer needed. This isn't strictly required, but avoids memory leaks if you use different
163171
> models throughout the lifecycle of your application.
164172
173+
### Chat Completion
174+
175+
For chat models, build a list of role/content pairs and let the library apply the model's chat template.
176+
`chatComplete()` returns the full response, `generateChat()` streams tokens, and `chatCompleteText()` returns
177+
just the text content of the assistant message.
178+
179+
```java
180+
List<Pair<String, String>> messages = new ArrayList<>();
181+
messages.add(new Pair<>("user", "Write a haiku about Java."));
182+
183+
InferenceParameters inferParams = new InferenceParameters("")
184+
.setMessages("You are a helpful assistant.", messages)
185+
.setUseChatTemplate(true);
186+
187+
try (LlamaModel model = new LlamaModel(modelParams)) {
188+
// Streaming
189+
for (LlamaOutput output : model.generateChat(inferParams)) {
190+
System.out.print(output);
191+
}
192+
// Or blocking, returns the OpenAI-compatible JSON envelope
193+
String json = model.chatComplete(inferParams);
194+
// Or just the assistant text
195+
String text = model.chatCompleteText(inferParams);
196+
}
197+
```
198+
199+
Reasoning/thinking models can receive custom Jinja template variables via
200+
`ModelParameters#setChatTemplateKwargs(Map)`.
201+
165202
### Infilling
166203

167204
You can simply set `InferenceParameters#setInputPrefix(String)` and `InferenceParameters#setInputSuffix(String)`.
168205

206+
### Embeddings & Reranking
207+
208+
Load the model with `enableEmbedding()` (or `enableReranking()`) and call `embed(String)` to get a sentence
209+
embedding, or `rerank(query, documents...)` to get relevance scores.
210+
211+
```java
212+
ModelParameters modelParams = new ModelParameters()
213+
.setModel("/path/to/embedding-model.gguf")
214+
.enableEmbedding();
215+
try (LlamaModel model = new LlamaModel(modelParams)) {
216+
float[] embedding = model.embed("Embed this sentence");
217+
}
218+
```
219+
220+
### Raw JSON Endpoints
221+
222+
For direct access to the upstream llama.cpp server API, the following methods take a JSON request and return
223+
a JSON response, matching the HTTP server's contract:
224+
225+
`handleCompletions`, `handleCompletionsOai`, `handleChatCompletions`, `handleInfill`,
226+
`handleEmbeddings`, `handleTokenize`, `handleDetokenize`.
227+
228+
Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
229+
`restoreSlot(int, String)`, and `getModelMeta()`.
230+
169231
### Model/Inference Configuration
170232

171233
There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
<groupId>net.ladenthin</groupId>
77
<artifactId>llama</artifactId>
8-
<version>4.2.0</version>
8+
<version>5.0.0-SNAPSHOT</version>
99
<packaging>jar</packaging>
1010

1111
<name>${project.groupId}:${project.artifactId}</name>

0 commit comments

Comments
 (0)