llmedge is a lightweight Android library for running GGUF language models fully on-device, powered by llama.cpp.
See the examples repository for sample usage.
Acknowledgments to Shubham Panchal and upstream projects are listed in CREDITS.md
.
- Run GGUF models directly on Android using llama.cpp (JNI)
- Download and cache models from Hugging Face
- Minimal on-device RAG (retrieval-augmented generation) pipeline
- Built-in memory usage metrics
- Optional Vulkan acceleration
Clone the repository along with the llama.cpp
submodule:
git clone --depth=1 https://github.com/Aatricks/llmedge
cd llmedge
git submodule update --init --recursive
Open the project in Android Studio. If it does not build automatically, use Build > Rebuild Project.
Load a local GGUF file and run a blocking prompt from a background coroutine:
val smol = SmolLM()
CoroutineScope(Dispatchers.IO).launch {
val modelFile = File(context.filesDir, "models/tinyllama.gguf")
smol.load(modelFile.absolutePath)
val reply = smol.getResponse("Summarize on-device LLMs in one sentence.")
withContext(Dispatchers.Main) {
outputView.text = reply
}
}
Call smol.close()
when the instance is no longer needed to free native memory.
llmedge can download and cache GGUF model weights directly from Hugging Face:
val smol = SmolLM()
val download = smol.loadFromHuggingFace(
context = context,
modelId = "unsloth/Qwen3-0.6B-GGUF",
filename = "Qwen3-0.6B-Q4_K_M.gguf", // optional
forceDownload = false,
preferSystemDownloader = true
)
Log.d("llmedge", "Loaded ${download.file.name} from ${download.file.parent}")
-
loadFromHuggingFace downloads (if needed) and loads the model immediately after.
-
Supports onProgress callbacks and private repositories via token.
-
Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.
-
Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with
InferenceParams(contextSize = …)
if needed. -
Large downloads use Android's DownloadManager when
preferSystemDownloader = true
to keep transfers out of the Dalvik heap. -
Advanced users can call
HuggingFaceHub.ensureModelOnDisk()
to manage caching and quantization manually.
SmolLM
lets you disable or re-enable "thinking" traces produced by reasoning-aware models through the ThinkingMode
enum and the optional reasoningBudget
parameter. The default configuration keeps thinking enabled (ThinkingMode.DEFAULT
, reasoning budget -1
). To start a session with thinking disabled (equivalent to passing --no-think
or --reasoning-budget 0
), specify it when loading the model:
val smol = SmolLM()
val params = SmolLM.InferenceParams(
thinkingMode = SmolLM.ThinkingMode.DISABLED,
reasoningBudget = 0, // explicit override, optional when the mode is DISABLED
)
smol.load(modelPath, params)
At runtime you can flip the behaviour without reloading the model:
smol.setThinkingEnabled(true) // restore the default
smol.setReasoningBudget(0) // force-disable thoughts again
val budget = smol.getReasoningBudget() // inspect the current budget
val mode = smol.getThinkingMode() // inspect the current mode
Setting the budget to 0
always disables thinking, while -1
leaves it unrestricted. If you omit reasoningBudget
, the library chooses 0
when the mode is DISABLED
and -1
otherwise. The API also injects the /no_think
tag automatically when thinking is disabled, so you do not need to modify prompts manually.
The library includes a minimal on-device RAG pipeline, similar to Android-Doc-QA, built with:
- Sentence embeddings (ONNX)
- Whitespace
TextSplitter
- In-memory cosine
VectorStore
with JSON persistence SmolLM
for context-aware responses
-
Download embeddings
From the Hugging Face repository
sentence-transformers/all-MiniLM-L6-v2
, place:
llmedge/src/main/assets/embeddings/all-minilm-l6-v2/model.onnx
llmedge/src/main/assets/embeddings/all-minilm-l6-v2/tokenizer.json
- Build the library
./gradlew :llmedge:assembleRelease
- Use in your application
val smol = SmolLM()
val rag = RAGEngine(context = this, smolLM = smol)
CoroutineScope(Dispatchers.IO).launch {
rag.init()
val count = rag.indexPdf(pdfUri)
val answer = rag.ask("What are the key points?")
withContext(Dispatchers.Main) {
// render answer
}
}
- Uses
com.tom-roush:pdfbox-android
for PDF parsing. - Embeddings library:
io.gitlab.shubham0204:sentence-embeddings:v6
. - Scanned PDFs require OCR (e.g., ML Kit or Tesseract) before indexing.
- ONNX
token_type_ids
errors are automatically handled; override viaEmbeddingConfig
if required.
- llama.cpp (C/C++) provides the core inference engine, built via the Android NDK.
LLMInference.cpp
wraps the llama.cpp C API.smollm.cpp
exposes JNI bindings for Kotlin.- The
SmolLM
Kotlin class provides a high-level API for model loading and inference.
- llama.cpp — Core LLM backend
- GGUF — Model format
- Android NDK / JNI — Native bindings
- ONNX Runtime — Sentence embeddings
- Android DownloadManager — Large file downloads
You can measure RAM usage at runtime:
val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", snapshot.toPretty(context))
Typical measurement points:
- Before model load
- After model load
- After blocking prompt
- After streaming prompt
totalPssKb
: Total proportional RAM usage. Best for overall tracking.dalvikPssKb
: JVM-managed heap and runtime.nativePssKb
: Native heap (llama.cpp, ONNX, tensors, KV cache).otherPssKb
: Miscellaneous memory.
Monitor nativePssKb
closely during model loading and inference to understand LLM memory footprint.
- Vulkan SDK may be required; set the
VULKAN_SDK
environment variable when building with Vulkan. - Vulkan acceleration can be checked via
SmolLM.isVulkanEnabled()
.
This project builds upon work by Shubham Panchal and ggerganov. See CREDITS.md for full details.