A simple Ollama-like tool for running Large Language Models (LLMs) locally using llama.cpp under the hood.
- 🚀 Local LLM Inference: Run models locally using llama.cpp
- 📥 Automatic Downloads: Download models from URLs or Hugging Face
- 💬 Interactive Chat: Chat with models in an interactive terminal
- 📋 Model Management: List, download, and remove models
- ⚙️ Configurable: Customize model parameters and settings
Install Easy Edge from PyPI:
pip install easy-edgeOr, to install the latest version from source:
git clone https://github.com/criminact/easy-edge.git
cd easy-edge
pip install .After installation, use the easy-edge command from your terminal:
easy-edge pull --repo-id TheBloke/Llama-2-7B-Chat-GGUF --filename llama-2-7b-chat.Q4_K_M.ggufOr download from a Hugging Face URL:
easy-edge pull --url https://huggingface.co/google/gemma-3-1b-it-qat-q4_0-gguf/resolve/main/gemma-3-1b-it-q4_0.ggufSingle prompt:
easy-edge run gemma-3-1b-it-qat-q4_0-gguf --prompt "Hello, how are you?"Interactive chat:
easy-edge run gemma-3-1b-it-qat-q4_0-gguf --interactiveeasy-edge listeasy-edge remove gemma-3-1b-it-qat-q4_0-ggufThe tool stores configuration in models/config.json. You can modify settings like:
max_tokens: Maximum tokens to generate (default: 2048)temperature: Sampling temperature (default: 0.7)top_p: Top-p sampling parameter (default: 0.9)
- Python 3.11+
- 8GB+ RAM (for 7B models)
- 16GB+ RAM (for 13B models)
- 4GB+ free disk space per model
-
"llama-cpp-python not installed"
pip install llama-cpp-python
-
Out of memory errors
- Try smaller models (7B instead of 13B)
- Use more quantized models (Q4_K_M instead of Q8_0)
- Close other applications to free up RAM
-
Slow inference
- The tool uses all CPU cores by default
- For better performance, consider using GPU acceleration (requires CUDA)
For faster inference with NVIDIA GPUs:
pip uninstall llama-cpp-python
pip install llama-cpp-python --force-reinstall --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu118Easy Edge supports finetuning LLMs using a Modelfile (Ollama-style) and Hugging Face Trainer. This allows you to create custom models for your own data and use them locally.
A Modelfile describes the base model, training parameters, and example messages for finetuning. Example:
HF_TOKEN <your_huggingface_token>
FROM meta-llama/Llama-3.2-1B-Instruct
PARAMETER device cpu
PARAMETER max_length 64
PARAMETER learning_rate 3e-5
PARAMETER epochs 4
PARAMETER batch_size 1
PARAMETER lora true
PARAMETER lora_r 8
PARAMETER lora_alpha 32
PARAMETER lora_dropout 0.05
PARAMETER lora_target_modules q_proj,v_proj
SYSTEM You are a helpful assistant.
MESSAGE user How can I reset my password?
MESSAGE assistant To reset your password, click on 'Forgot Password' at the login screen and follow the instructions.
HF_TOKENis your Hugging Face access token (required for private models).FROMspecifies the base model to finetune.PARAMETERlines set training options (see above for examples).SYSTEMandMESSAGEblocks provide training data.
Use the finetune command to start training:
easy-edge finetune --modelfile Modelfile --output my-finetuned-model --epochs 4 --batch-size 1 --learning-rate 3e-5--modelfileis the path to your Modelfile.--outputis where the trained model will be saved.- You can override epochs, batch size, and learning rate on the command line.
After training, you will see instructions to convert your model to GGUF format for use with llama.cpp:
python3 convert_hf_to_gguf.py --in my-finetuned-model --out my-finetuned-model.ggufUpload your GGUF file to Hugging Face or use it locally with Easy Edge.
- Finetuning is resource-intensive. For best results, use a machine with a GPU.
- LoRA/PEFT is supported for efficient finetuning.
- See the example Modelfile in the repository for more options.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details.
- llama.cpp - The underlying inference engine
- Ollama - Inspiration for the tool design
- Hugging Face - Model hosting and distribution