llmedge

llmedge is a lightweight Android library for running GGUF language models fully on-device, powered by llama.cpp.

See the examples repository for sample usage.

Acknowledgments to Shubham Panchal and upstream projects are listed in CREDITS.md.

Features

Run GGUF models directly on Android using llama.cpp (JNI)
Download and cache models from Hugging Face
Minimal on-device RAG (retrieval-augmented generation) pipeline
Built-in memory usage metrics
Optional Vulkan acceleration

Installation

Clone the repository along with the llama.cpp submodule:

git clone --depth=1 https://github.com/Aatricks/llmedge
cd llmedge
git submodule update --init --recursive

Open the project in Android Studio. If it does not build automatically, use Build > Rebuild Project.

Usage

Quick Start

Load a local GGUF file and run a blocking prompt from a background coroutine:

val smol = SmolLM()

CoroutineScope(Dispatchers.IO).launch {
    val modelFile = File(context.filesDir, "models/tinyllama.gguf")
    smol.load(modelFile.absolutePath)

    val reply = smol.getResponse("Summarize on-device LLMs in one sentence.")
    withContext(Dispatchers.Main) {
        outputView.text = reply
    }
}

Call smol.close() when the instance is no longer needed to free native memory.

Downloading Models

llmedge can download and cache GGUF model weights directly from Hugging Face:

val smol = SmolLM()

val download = smol.loadFromHuggingFace(
    context = context,
    modelId = "unsloth/Qwen3-0.6B-GGUF",
    filename = "Qwen3-0.6B-Q4_K_M.gguf", // optional
    forceDownload = false,
    preferSystemDownloader = true
)

Log.d("llmedge", "Loaded ${download.file.name} from ${download.file.parent}")

Key points:

loadFromHuggingFace downloads (if needed) and loads the model immediately after.
Supports onProgress callbacks and private repositories via token.
Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.
Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with InferenceParams(contextSize = …) if needed.
Large downloads use Android's DownloadManager when preferSystemDownloader = true to keep transfers out of the Dalvik heap.
Advanced users can call HuggingFaceHub.ensureModelOnDisk() to manage caching and quantization manually.

Reasoning Controls

SmolLM lets you disable or re-enable "thinking" traces produced by reasoning-aware models through the ThinkingMode enum and the optional reasoningBudget parameter. The default configuration keeps thinking enabled (ThinkingMode.DEFAULT, reasoning budget -1). To start a session with thinking disabled (equivalent to passing --no-think or --reasoning-budget 0), specify it when loading the model:

val smol = SmolLM()

val params = SmolLM.InferenceParams(
    thinkingMode = SmolLM.ThinkingMode.DISABLED,
    reasoningBudget = 0, // explicit override, optional when the mode is DISABLED
)
smol.load(modelPath, params)

At runtime you can flip the behaviour without reloading the model:

smol.setThinkingEnabled(true)              // restore the default
smol.setReasoningBudget(0)                 // force-disable thoughts again
val budget = smol.getReasoningBudget()     // inspect the current budget
val mode = smol.getThinkingMode()          // inspect the current mode

Setting the budget to 0 always disables thinking, while -1 leaves it unrestricted. If you omit reasoningBudget, the library chooses 0 when the mode is DISABLED and -1 otherwise. The API also injects the /no_think tag automatically when thinking is disabled, so you do not need to modify prompts manually.

On-device RAG

The library includes a minimal on-device RAG pipeline, similar to Android-Doc-QA, built with:

Sentence embeddings (ONNX)
Whitespace TextSplitter
In-memory cosine VectorStore with JSON persistence
SmolLM for context-aware responses

Setup

Download embeddings

From the Hugging Face repository sentence-transformers/all-MiniLM-L6-v2, place:

llmedge/src/main/assets/embeddings/all-minilm-l6-v2/model.onnx
llmedge/src/main/assets/embeddings/all-minilm-l6-v2/tokenizer.json

Build the library

./gradlew :llmedge:assembleRelease

Use in your application

    val smol = SmolLM()
    val rag = RAGEngine(context = this, smolLM = smol)

    CoroutineScope(Dispatchers.IO).launch {
        rag.init()
        val count = rag.indexPdf(pdfUri)
        val answer = rag.ask("What are the key points?")
        withContext(Dispatchers.Main) {
            // render answer
        }
    }

Notes:

Uses com.tom-roush:pdfbox-android for PDF parsing.
Embeddings library: io.gitlab.shubham0204:sentence-embeddings:v6.
Scanned PDFs require OCR (e.g., ML Kit or Tesseract) before indexing.
ONNX token_type_ids errors are automatically handled; override via EmbeddingConfig if required.

Architecture

llama.cpp (C/C++) provides the core inference engine, built via the Android NDK.
LLMInference.cpp wraps the llama.cpp C API.
smollm.cpp exposes JNI bindings for Kotlin.
The SmolLM Kotlin class provides a high-level API for model loading and inference.

Technologies

llama.cpp — Core LLM backend
GGUF — Model format
Android NDK / JNI — Native bindings
ONNX Runtime — Sentence embeddings
Android DownloadManager — Large file downloads

Memory Metrics

You can measure RAM usage at runtime:

val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", snapshot.toPretty(context))

Typical measurement points:

Before model load
After model load
After blocking prompt
After streaming prompt

Key fields:

totalPssKb: Total proportional RAM usage. Best for overall tracking.
dalvikPssKb: JVM-managed heap and runtime.
nativePssKb: Native heap (llama.cpp, ONNX, tensors, KV cache).
otherPssKb: Miscellaneous memory.

Monitor nativePssKb closely during model loading and inference to understand LLM memory footprint.

Notes

Vulkan SDK may be required; set the VULKAN_SDK environment variable when building with Vulkan.
Vulkan acceleration can be checked via SmolLM.isVulkanEnabled().

License and Credits

This project builds upon work by Shubham Panchal and ggerganov. See CREDITS.md for full details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
gradle		gradle
llama.cpp		llama.cpp
llmedge		llmedge
.clang-format		.clang-format
.gitignore		.gitignore
CREDITS.md		CREDITS.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
local.properties		local.properties
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llmedge

Features

Table of Contents

Installation

Usage

Quick Start

Downloading Models

Key points:

Reasoning Controls

On-device RAG

Setup

Notes:

Architecture

Technologies

Memory Metrics

Key fields:

Notes

License and Credits

About

Uh oh!

Releases

Packages

Languages

License

Aatricks/llmedge

Folders and files

Latest commit

History

Repository files navigation

llmedge

Features

Table of Contents

Installation

Usage

Quick Start

Downloading Models

Key points:

Reasoning Controls

On-device RAG

Setup

Notes:

Architecture

Technologies

Memory Metrics

Key fields:

Notes

License and Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages