A Model Context Protocol (MCP) server that provides text completion capabilities using local LLama.cpp models. This server exposes a single MCP tool that accepts text prompts and returns AI-generated completions using locally hosted language models.
Byte Vision MCP is a bridge between MCP-compatible clients (like Claude Desktop, IDEs, or other AI tools) and local LLama.cpp language models. It allows you to:
- Use local language models through the MCP protocol
- Configure all model parameters via environment files
- Generate text completions with custom prompts
- Maintain privacy by keeping everything local
- Integrate with MCP-compatible applications
- MCP Protocol Support: Standard MCP server implementation
- Local Model Execution: Uses LLama.cpp for model inference
- Configurable Parameters: All settings controlled via environment file
- GPU Acceleration: Supports CUDA, ROCm, and Metal
- Prompt Caching: Built-in caching for improved performance
- Comprehensive Logging: Detailed logging for debugging and monitoring
- Graceful Shutdown: Proper resource cleanup and error handling
- github.com/joho/godotenv v1.5.1 - Environment variable loading from
.envfiles - github.com/metoro-io/mcp-golang v0.12.0 - Model Context Protocol (MCP) implementation for Go
- github.com/gin-gonic/gin v1.8.1 - HTTP web framework
- github.com/gin-contrib/sse v0.1.0 - Server-Sent Events support
- Go SDK 1.23+ - Modern Go runtime with latest features
- LLama.cpp - Local language model inference engine
- GGUF Models - Quantized language models in GGUF format
- Go 1.23+ for building the server
- LLama.cpp binaries (see
/llamacpp/README.mdfor installation) - GGUF format models (see
/models/README.mdfor sources)
git clone <repository-url>
cd byte-vision-mcp
go mod tidy
go build -o byte-vision-mcpFollow the instructions in: /llamacpp/README.md
- Download prebuilt binaries, or
- Build from source
See: /models/README.md
- Recommended model sources
- How to download GGUF models
- Model placement instructions
Copy the example configuration:
cp example-byte-vision-cfg.env byte-vision-cfg.env
Edit to match your setup: byte-vision-cfg.env
LLamaCliPath=/path/to/your/llama-cli ModelFullPathVal=/path/to/your/model.gguf AppLogPath=/path/to/logs/
./byte-vision-mcp
The server will start on http://localhost:8080/mcp-completion by default.
byte-vision-mcp/
├── llamacpp/ # LLama.cpp binaries and installation guide
├── logs/ # Application and model logs
├── models/ # GGUF model files
├── prompt-cache/ # Cached prompts for performance
├── main.go # Main MCP server implementation
├── model.go # Model execution logic
├── types.go # Configuration structures
├── byte-vision-cfg.env # Your configuration (create from example)
└── example-byte-vision-cfg.env # Example configuration
The file controls all aspects of the server: byte-vision-cfg.env
- : Directory for log files
AppLogPath - : Log file name
AppLogFileName - : Server port (default
:8080)HttpPort - : MCP endpoint path (default )
EndPoint``/mcp-completion - : Request timeout (default
300)TimeOutSeconds
- : Path to llama-cli executable
LLamaCliPath - : Path to your GGUF model file
ModelFullPathVal - : Context window size
CtxSizeVal - : Number of layers to offload to GPU
GPULayersVal - : Generation temperature
TemperatureVal - : Maximum tokens to generate
PredictVal - And many more LLama.cpp parameters...
The server exposes a single MCP tool that accepts both basic and advanced parameters for fine-tuning LLama.cpp execution.
**Input:**
{
"prompt": "Write a short story about a robot learning to paint"
}
**Output:**
{
"content": [
{
"type": "text",
"text": "Generated completion text..."
}
]
}Full Parameter Input:
{
"prompt": "Explain quantum computing in simple terms",
"temperature": 0.7,
"predict": 500,
"top_k": 40,
"top_p": 0.9,
"ctx_size": 4096,
"threads": 8,
"gpu_layers": 35
}| Parameter | Type | Description | Example | Default Source |
|---|---|---|---|---|
model |
string | Override model path | "/path/to/model.gguf" |
ModelFullPathVal |
threads |
int | CPU threads for generation | 8 |
ThreadsVal |
gpu_layers |
int | GPU acceleration layers | 35 |
GPULayersVal |
ctx_size |
int | Context window size | 4096 |
CtxSizeVal |
batch_size |
int | Batch processing size | 512 |
BatchCmdVal |
| Parameter | Type | Description | Range | Default Source |
|---|---|---|---|---|
predict |
int | Number of tokens to generate | 1-8192 |
PredictVal |
temperature |
float | Creativity/randomness control | 0.0-2.0 |
TemperatureVal |
top_k |
int | Top-K sampling | 1-100 |
TopKVal |
top_p |
float | Top-P (nucleus) sampling | 0.0-1.0 |
TopPVal |
repeat_penalty |
float | Repetition penalty | 0.5-2.0 |
RepeatPenaltyVal |
| Parameter | Type | Description | Example | Default Source |
|---|---|---|---|---|
prompt_file |
string | Load prompt from file | "/path/to/prompt.txt" |
PromptFileVal |
log_file |
string | Custom log file path | "/path/to/custom.log" |
ModelLogFileNameVal |
{
"prompt": "Write a creative story about time travel",
"temperature": 1.2,
"top_p": 0.95,
"predict": 1000,
"repeat_penalty": 1.1
}{
"prompt": "Explain the TCP/IP protocol stack",
"temperature": 0.3,
"top_k": 10,
"predict": 800,
"ctx_size": 8192
}{
"prompt": "Write a Python function to sort a list",
"temperature": 0.6,
"top_k": 30,
"top_p": 0.8,
"predict": 400
}{
"prompt": "Summarize this document...",
"ctx_size": 32768,
"gpu_layers": 40,
"batch_size": 1024,
"predict": 500
}{
"prompt": "Quick question about Go syntax",
"threads": 12,
"gpu_layers": 45,
"batch_size": 2048,
"predict": 200,
"temperature": 0.4
}{
"prompt_file": "/path/to/complex_prompt.txt",
"temperature": 0.8,
"predict": 1500,
"log_file": "/path/to/custom_generation.log"
}- **0.0-0.3**: Highly deterministic, factual responses
- **0.4-0.7**: Balanced creativity and coherence
- **0.8-1.2**: Creative, varied responses
- **1.3-2.0**: Highly creative, potentially chaotic
- **2048-4096**: Short conversations, simple tasks
- **8192-16384**: Medium documents, complex reasoning
- **32768+**: Long documents, extensive context
- **0**: CPU-only processing
- **25-35**: Balanced CPU/GPU (8GB VRAM)
- **40+**: Full GPU acceleration (12GB+ VRAM)
- **50-200**: Short answers, code snippets
- **300-800**: Medium explanations, documentation
- **1000+**: Long-form content, stories
{
"ctx_size": 4096, // Lower for limited RAM
"batch_size": 512, // Smaller batches for stability
"gpu_layers": 25 // Reduce if GPU memory limited
}{
"threads": 8, // Match CPU cores
"gpu_layers": 45, // Maximize GPU usage
"batch_size": 2048, // Larger batches for throughput
"predict": 200 // Shorter for quick responses
}{
"temperature": 0.7, // Balanced creativity
"top_k": 40, // Diverse sampling
"top_p": 0.9, // Nucleus sampling
"repeat_penalty": 1.1, // Reduce repetition
"ctx_size": 8192 // Larger context for coherence
}The tool provides specific error messages for invalid parameters:
{
"content": [
{
"type": "text",
"text": "Error: Invalid temperature value. Must be between 0.0 and 2.0"
}
]
}- All parameters are optional except
prompt - Environment configuration is used when parameters are not specified
- Zero values are ignored (e.g.,
temperature: 0uses config default) - Invalid values fall back to configuration defaults
const result = await mcpClient.callTool("generate_completion", {
prompt: "Explain async/await in JavaScript",
temperature: 0.5,
predict: 600,
top_k: 25
});
console.log(result.content[0].text);response = mcp_client.call_tool("generate_completion", {
"prompt": "Write a Python class for a binary tree",
"temperature": 0.6,
"predict": 800,
"top_p": 0.8
})
print(response["content"][0]["text"])curl -X POST http://localhost:8080/mcp-completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain Docker containers",
"temperature": 0.4,
"predict": 500,
"ctx_size": 4096
}'
This comprehensive parameter system allows fine-grained control over LLama.cpp behavior while maintaining backward compatibility and ease of use.
- Download CUDA-enabled LLama.cpp binaries
- Set
GPULayersVal=33(or adjust based on your GPU memory) - Set
MainGPUVal=0(or your preferred GPU index)
- Download ROCm-enabled LLama.cpp binaries
- Configure similar to CUDA setup
- Metal support is built-in
- No additional configuration needed
Logs are written to both console and file:
- Application logs:
logs/byte-vision-mcp.log - Model logs:
logs/[model-name].log - Configurable log levels and verbosity
See for log management details. /logs/README.md
-
"llama-cli not found"
- Check in your file
LLamaCliPath``.env - Ensure the binary has execute permissions
- Check in your file
-
"Model file not found"
- Verify points to a valid
.gguffileModelFullPathVal - Check file permissions
- Verify points to a valid
-
Out of memory errors
- Reduce
CtxSizeVal - Use a smaller model
- Increase for GPU offloading
GPULayersVal
- Reduce
-
Slow generation
- Enable GPU acceleration
- Increase
GPULayersVal - Use quantized models (Q4, Q5, Q8)
-
Server won't start
- Check if port is already in use
- Verify all paths in configuration exist
- Check logs for detailed error messages
go mod tidy go build -o byte-vision-mcp
go test ./...
-
- Environment file loading
github.com/joho/godotenv
- Environment file loading
-
- MCP protocol implementation
github.com/metoro-io/mcp-golang
- MCP protocol implementation
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the terms of the MIT license.
- Check the individual README files in each subdirectory for specific setup instructions
- Review logs for detailed error information
- Ensure all paths in configuration are absolute and accessible
Author: Kevin Brisson
Email: kbrisso@gmail.com
LinkedIn: Kevin Brisson
Project Link: https://github.com/kbrisso/byte-vision-mcp