You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Text completion (blocking and streaming) with full control over sampling parameters.
26
+
- OpenAI-compatible **chat completion** with automatic chat-template application, including streaming and tool/function calling support via the upstream server.
27
+
-**Embeddings** and **reranking** for retrieval pipelines.
28
+
-**Infilling** (fill-in-the-middle) for code models.
29
+
-**Tokenize / detokenize** and **JSON-schema → grammar** conversion.
30
+
-**Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
25
33
26
34
## Quick Start
27
35
@@ -31,7 +39,7 @@ Access this library via Maven:
31
39
<dependency>
32
40
<groupId>net.ladenthin</groupId>
33
41
<artifactId>llama</artifactId>
34
-
<version>4.2.0</version>
42
+
<version>5.0.0-SNAPSHOT</version>
35
43
</dependency>
36
44
```
37
45
@@ -162,10 +170,64 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
162
170
> freed when the model is no longer needed. This isn't strictly required, but avoids memory leaks if you use different
163
171
> models throughout the lifecycle of your application.
164
172
173
+
### Chat Completion
174
+
175
+
For chat models, build a list of role/content pairs and let the library apply the model's chat template.
176
+
`chatComplete()` returns the full response, `generateChat()` streams tokens, and `chatCompleteText()` returns
0 commit comments