Skip to content

Aatricks/llmedge

Repository files navigation

llmedge

llmedge is a lightweight Android library for running GGUF language models fully on-device, powered by llama.cpp.

See the examples repository for sample usage.

Acknowledgments to Shubham Panchal and upstream projects are listed in CREDITS.md.


Features

  • Run GGUF models directly on Android using llama.cpp (JNI)
  • Download and cache models from Hugging Face
  • Minimal on-device RAG (retrieval-augmented generation) pipeline
  • Built-in memory usage metrics
  • Optional Vulkan acceleration

Table of Contents

  1. Installation
  2. Usage
  3. Building
  4. Architecture
  5. Technologies
  6. Memory Metrics
  7. Notes

Installation

Clone the repository along with the llama.cpp submodule:

git clone --depth=1 https://github.com/Aatricks/llmedge
cd llmedge
git submodule update --init --recursive

Open the project in Android Studio. If it does not build automatically, use Build > Rebuild Project.

Usage

Quick Start

Load a local GGUF file and run a blocking prompt from a background coroutine:

val smol = SmolLM()

CoroutineScope(Dispatchers.IO).launch {
    val modelFile = File(context.filesDir, "models/tinyllama.gguf")
    smol.load(modelFile.absolutePath)

    val reply = smol.getResponse("Summarize on-device LLMs in one sentence.")
    withContext(Dispatchers.Main) {
        outputView.text = reply
    }
}

Call smol.close() when the instance is no longer needed to free native memory.

Downloading Models

llmedge can download and cache GGUF model weights directly from Hugging Face:

val smol = SmolLM()

val download = smol.loadFromHuggingFace(
    context = context,
    modelId = "unsloth/Qwen3-0.6B-GGUF",
    filename = "Qwen3-0.6B-Q4_K_M.gguf", // optional
    forceDownload = false,
    preferSystemDownloader = true
)

Log.d("llmedge", "Loaded ${download.file.name} from ${download.file.parent}")

Key points:

  • loadFromHuggingFace downloads (if needed) and loads the model immediately after.

  • Supports onProgress callbacks and private repositories via token.

  • Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.

  • Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with InferenceParams(contextSize = …) if needed.

  • Large downloads use Android's DownloadManager when preferSystemDownloader = true to keep transfers out of the Dalvik heap.

  • Advanced users can call HuggingFaceHub.ensureModelOnDisk() to manage caching and quantization manually.

Reasoning Controls

SmolLM lets you disable or re-enable "thinking" traces produced by reasoning-aware models through the ThinkingMode enum and the optional reasoningBudget parameter. The default configuration keeps thinking enabled (ThinkingMode.DEFAULT, reasoning budget -1). To start a session with thinking disabled (equivalent to passing --no-think or --reasoning-budget 0), specify it when loading the model:

val smol = SmolLM()

val params = SmolLM.InferenceParams(
    thinkingMode = SmolLM.ThinkingMode.DISABLED,
    reasoningBudget = 0, // explicit override, optional when the mode is DISABLED
)
smol.load(modelPath, params)

At runtime you can flip the behaviour without reloading the model:

smol.setThinkingEnabled(true)              // restore the default
smol.setReasoningBudget(0)                 // force-disable thoughts again
val budget = smol.getReasoningBudget()     // inspect the current budget
val mode = smol.getThinkingMode()          // inspect the current mode

Setting the budget to 0 always disables thinking, while -1 leaves it unrestricted. If you omit reasoningBudget, the library chooses 0 when the mode is DISABLED and -1 otherwise. The API also injects the /no_think tag automatically when thinking is disabled, so you do not need to modify prompts manually.

On-device RAG

The library includes a minimal on-device RAG pipeline, similar to Android-Doc-QA, built with:

  • Sentence embeddings (ONNX)
  • Whitespace TextSplitter
  • In-memory cosine VectorStore with JSON persistence
  • SmolLM for context-aware responses

Setup

  1. Download embeddings

    From the Hugging Face repository sentence-transformers/all-MiniLM-L6-v2, place:

llmedge/src/main/assets/embeddings/all-minilm-l6-v2/model.onnx
llmedge/src/main/assets/embeddings/all-minilm-l6-v2/tokenizer.json
  1. Build the library
./gradlew :llmedge:assembleRelease
  1. Use in your application
    val smol = SmolLM()
    val rag = RAGEngine(context = this, smolLM = smol)

    CoroutineScope(Dispatchers.IO).launch {
        rag.init()
        val count = rag.indexPdf(pdfUri)
        val answer = rag.ask("What are the key points?")
        withContext(Dispatchers.Main) {
            // render answer
        }
    }

Notes:

  • Uses com.tom-roush:pdfbox-android for PDF parsing.
  • Embeddings library: io.gitlab.shubham0204:sentence-embeddings:v6.
  • Scanned PDFs require OCR (e.g., ML Kit or Tesseract) before indexing.
  • ONNX token_type_ids errors are automatically handled; override via EmbeddingConfig if required.

Architecture

  1. llama.cpp (C/C++) provides the core inference engine, built via the Android NDK.
  2. LLMInference.cpp wraps the llama.cpp C API.
  3. smollm.cpp exposes JNI bindings for Kotlin.
  4. The SmolLM Kotlin class provides a high-level API for model loading and inference.

Technologies

  • llama.cpp — Core LLM backend
  • GGUF — Model format
  • Android NDK / JNI — Native bindings
  • ONNX Runtime — Sentence embeddings
  • Android DownloadManager — Large file downloads

Memory Metrics

You can measure RAM usage at runtime:

val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", snapshot.toPretty(context))

Typical measurement points:

  • Before model load
  • After model load
  • After blocking prompt
  • After streaming prompt

Key fields:

  • totalPssKb: Total proportional RAM usage. Best for overall tracking.
  • dalvikPssKb: JVM-managed heap and runtime.
  • nativePssKb: Native heap (llama.cpp, ONNX, tensors, KV cache).
  • otherPssKb: Miscellaneous memory.

Monitor nativePssKb closely during model loading and inference to understand LLM memory footprint.

Notes

  • Vulkan SDK may be required; set the VULKAN_SDK environment variable when building with Vulkan.
  • Vulkan acceleration can be checked via SmolLM.isVulkanEnabled().

License and Credits

This project builds upon work by Shubham Panchal and ggerganov. See CREDITS.md for full details.

About

Library for using gguf models on android devices, powered by llama.cpp

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published