An API to query local language models using different backends with a unified interface. LocalLm supports multiple inference engines and provides a consistent way to interact with various local LLM providers.
| Version | Name | Description | Doc |
|---|---|---|---|
| @locallm/types | The shared data types | Api doc - Readme | |
| @locallm/api | Run local language models using different backends | Api doc - Readme | |
| @locallm/browser | Run quantitized language models inside the browser | Api doc - Readme |
LocalLm provides a unified interface for multiple local language model backends, allowing you to:
- Switch between different inference engines without changing your code
- Access advanced features like streaming, tool calling, and multimodal support
- Work with consistent APIs across different providers
- Get detailed statistics and progress tracking
- Leverage TypeScript support for better development experience
- Llama.cpp - High-performance inference with C/C++ backend
- Koboldcpp - Feature-rich inference with GPU support
- Ollama - Easy-to-use local model management
- Wllama - In-browser inference using WebAssembly
- Any OpenAI compatible endpoint - Connect to custom or cloud OpenAI APIs
- Multiple Backend Support: Seamlessly switch between different inference engines
- Streaming Responses: Real-time token streaming for interactive applications
- Tool/Function Calling: Execute functions and tools during inference
- Multimodal Support: Process both text and images (where supported)
- Progress Tracking: Monitor model loading and inference progress
- Detailed Statistics: Get comprehensive performance metrics
- TypeScript Support: Full type definitions for better development
- Error Handling: Robust error handling and recovery mechanisms
- Node.js 18 or higher
- One of the supported backends running locally or accessible via network
# Install the API package
npm install @locallm/api
# Install types package (if needed separately)
npm install @locallm/typesimport { Lm } from "@locallm/api";
const lm = new Lm({
providerType: "koboldcpp",
serverUrl: "http://localhost:5001",
onToken: (t) => process.stdout.write(t),
});
const template = "<s>[INST] {prompt} [/INST]";
const prompt = template.replace("{prompt}", "List the planets in our solar system");
// Run the inference query
const result = await lm.infer(prompt, {
stream: true,
temperature: 0.2,
max_tokens: 200,
});
console.log("\nResult:", result.text);
console.log("Stats:", result.stats);import { Lm } from "@locallm/api";
const lm = new Lm({
providerType: "openai",
serverUrl: "http://localhost:8080/v1",
onToken: (t) => process.stdout.write(t),
});
// Handle graceful shutdown
process.on('SIGINT', () => {
lm.abort().then(() => process.exit());
});
const prompt = "Explain quantum computing in simple terms";
const result = await lm.infer(prompt, {
stream: true,
temperature: 0.7,
max_tokens: 300,
});
console.log("\nFull response:", result.text);// Load a specific model with context size
await lm.loadModel("llama3:8b", 8192);
// Check loaded model info
console.log("Current model:", lm.model);
console.log("Available models:", lm.models);// Using built-in templates
const prompt = lm.template.prompt("What is the capital of France?");
// Using custom templates
const customTemplate = "You are a helpful assistant. User: {prompt} Assistant:";
const formattedPrompt = customTemplate.replace("{prompt}", "Explain photosynthesis");const weatherTool = {
name: "getWeather",
description: "Get current weather for a location",
arguments: {
location: {
description: "The city and state, e.g. San Francisco, CA",
required: true
}
}
};
const result = await lm.infer("What's the weather in London?", {
stream: true,
tools: [weatherTool]
});
// Handle tool calls
if (result.toolCalls) {
for (const toolCall of result.toolCalls) {
console.log("Tool called:", toolCall.name);
console.log("Arguments:", toolCall.arguments);
}
}import { convertImageUrlToBase64 } from "@locallm/api";
// Convert image to base64
const imageBase64 = await convertImageUrlToBase64("https://example.com/image.jpg");
const result = await lm.infer("Describe this image", {
stream: true,
images: [imageBase64],
max_tokens: 300
});const history = [
{ user: "Hello", assistant: "Hi there!" },
{ user: "How are you?", assistant: "I'm doing well, thanks!" }
];
const result = await lm.infer("What's your name?", {
stream: true},
{ history: history }
);const lm = new Lm({
providerType: "ollama", // "llamacpp" | "koboldcpp" | "ollama" | "openai" | "browser"
serverUrl: "http://localhost:11434",
apiKey: "your-api-key-if-required", // Optional for most providers
onToken: (token) => process.stdout.write(token), // Optional: streaming callback
onStartEmit: (stats) => console.log("Started:", stats), // Optional: start callback
onEndEmit: (result) => console.log("Completed:", result), // Optional: completion callback
onError: (error) => console.error("Error:", error), // Optional: error callback
});const params = {
stream: true, // Stream response token by token
model: { name: "llama3:8b", ctx: 8192 }, // Model configuration
template: "chatml", // Template name (if supported)
max_tokens: 500, // Maximum tokens to generate
temperature: 0.7, // Randomness (0.0-1.0)
top_p: 0.9, // Nucleus sampling threshold
top_k: 50, // Limit to top K tokens
repeat_penalty: 1.1, // Penalty for repeating tokens
stop: ["</s>", "###"], // Stop sequences
grammar: "root ::= 'hello' 'world';", // GBNF grammar for constrained generation
images: ["base64-image-data"], // For multimodal models
extra: { custom: "parameters" } // Provider-specific parameters
};The examples directory contains comprehensive examples for each provider:
| Example | Description | Provider |
|---|---|---|
basic.js |
Basic text generation | All providers |
streaming.js |
Streaming responses | All providers |
ollama.js |
Ollama-specific features | Ollama |
ollama_img.js |
Image input with Ollama | Ollama |
ollama_tools.js |
Tool calling with Ollama | Ollama |
llamacpp.js |
Llama.cpp basic usage | Llama.cpp |
llamacpp_gnbf.js |
Grammar-based generation | Llama.cpp |
koboldcpp.js |
Koboldcpp basic usage | Koboldcpp |
openai_api.js |
OpenAI compatible endpoint | OpenAI |
openai_api_toolcall.js |
Tool calling with OpenAI | OpenAI |
openrouter.js |
Using OpenRouter service | OpenAI |
# Clone the repository
git clone https://github.com/synw/locallm
cd locallm
# Install dependencies
npm install
# Build the API package
cd packages/api
npm run build
cd ../..
# Install example dependencies
cd examples
npm install
# Run an example (make sure your LLM server is running)
node llamacpp.js- Use
await lm.modelsInfo()to list available models - Models are loaded using
await lm.loadModel(modelName, contextSize) - Supports multimodal models with the
imagesparameter - Use
raw: truein extra parameters for raw prompt mode
- Compatible with OpenAI-compatible endpoints
- Use
grammarparameter for constrained generation - Stop sequences can be specified with the
stopparameter - Server info available via
await lm.info()
- Template support with
{prompt}placeholder - Uses
/api/extra/generate/streamendpoint - Supports various inference parameters
- Auto-retrieves model info on inference
- Works with any OpenAI-compatible endpoint
- Full support for tool/function calling
- System messages via
systemparameter - History management for conversations
try {
const result = await lm.infer(prompt, params);
console.log("Success:", result.text);
} catch (error) {
console.error("Inference failed:", error.message);
// Handle specific error types
if (error.message.includes("connection")) {
// Handle connection errors
} else if (error.message.includes("model")) {
// Handle model-related errors
}
}// Access detailed statistics
const result = await lm.infer(prompt, { stream: true });
console.log("Inference Statistics:");
console.log("- Total time:", result.stats.totalTime, "ms");
console.log("- Inference time:", result.stats.inferenceTime, "ms");
console.log("- Tokens per second:", result.stats.tokensPerSecond);
console.log("- Total tokens:", result.stats.totalTokens);
console.log("- Server stats:", result.serverStats);Connection Errors
- Ensure your LLM server is running and accessible
- Check the server URL and port
- Verify network connectivity
Model Loading Issues
- Confirm the model name is correct
- Check if the model is available on the server
- Verify sufficient system resources
Performance Issues
- Adjust
temperatureandtop_pparameters - Consider reducing
max_tokensfor faster responses - Check system resources (CPU, memory, GPU)
Enable debug output for troubleshooting:
const result = await lm.infer(prompt, {
stream: true
},
{ debug: true }
);Q: Can I use LocalLm with cloud providers? A: Yes, the OpenAI-compatible provider works with many cloud services that provide OpenAI-compatible APIs.
Q: How do I add a new provider?
A: See the packages/api/src/providers directory for examples of implementing new providers.
Q: What's the difference between the packages?
A: @locallm/api provides the main interface, @locallm/types contains shared type definitions, and @locallm/browser is for browser-based inference.
This project is licensed under the MIT License - see the LICENSE file for details.