Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
-
Updated
Nov 6, 2023 - Python
Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
Speaker diarization for Python — "who spoke when?" CPU-only, no API keys, Apache 2.0. ~10.8% DER on VoxConverse, 8x faster than real-time.
Running Mixture of Agents on CPU: LFM2.5 Brain (1.2B) + Falcon-R Reasoner (600M) + Tool Caller (90M). CPU-only, 16GB RAM. Lightweight AI Legion.
Wrapper for simplified use of Llama2 GGUF quantized models.
A FastAPI server for querying Google's Gemma Translate AI models for translations
Lightning-fast RAG for AI agents. ONNX-powered, 4-layer fusion, MCP server. No PyTorch.
Privacy-focused RAG chatbot for network documentation. Chat with your PDFs locally using Ollama, Chroma & LangChain. CPU-only, fully offline.
🎤 Voice Studio - 语音识别与合成工具箱,支持实时流式转写、CPU推理、离线模式、桌面悬浮话筒 | ASR & TTS toolkit with real-time streaming, CPU inference, offline mode, floating mic, Web UI & CLI
VibeDrift - Run any LLM on your own hardware. Bypass the VRAM wall with CPU/RAM inference, MOE expert offloading, and 4-bit quantization. No Cloud, no Subscription.
Absolute Zero Reasoning Experiments on CPU
Neuro-symbolic inference framework for edge-class hardware. Fuses INT8-quantized neural anomaly detection with formal symbolic reasoning and explainable proof trees. Sub-millisecond latency on AMD Ryzen PRO — no GPU required.
CPU-only AI math storyteller with RAG, SymPy verification, and coherence tracking
NeuroSwift 1.0.0 is the world's most advanced MatMul-Free Hybrid State-Space Model (H-SSM). By integrating Dynamic Depth Scaling (DDS), Selective SSD (Mamba-2), and MLA (DeepSeek), it achieves the intelligence of the world's largest dense models with zero-latency CPU inference.
Interactive GPT-2 inference explorer with token probability visualization, entropy curves, confidence heatmap, and sampling strategy comparison. Built on nanoGPT.
CPU-first, turn-aware local voice assistant with multiprocessing, streaming STT→LLM→TTS, and interruption-safe orchestration.
Lightweight LLM API stack for local or cloud CPU deployment. OpenAI-compatible inference with llama.cpp, managed through Docker Compose with built-in monitoring, alerting, and request logging.
Un sistema RAG per chattare con documenti locali usando Foundry e modelli LLM su CPU
Evaluate SLMs (Phi-4-mini, Gemma-3-4B, Qwen3-3B) on infrastructure NLP tasks
🤖 AI Text Completion App built with Streamlit and Llama-3.2-1B. Generate creative text completions with an intuitive web interface. GPU & CPU optimized, easy to deploy, perfect for content creation and AI experimentation.
Personal project. Local RAG chatbot using Mistralv0.2/TinyLlama with TF-IDF retrieval. Streamlit interface for CPU-optimized inference without GPU requirements.
Add a description, image, and links to the cpu-inference topic page so that developers can more easily learn about it.
To associate your repository with the cpu-inference topic, visit your repo's landing page and select "manage topics."