< English | 中文 >
IPEX-LLM is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU 1.
Note
- IPEX-LLMprovides seamless integration with llama.cpp, Ollama, vLLM, HuggingFace transformers, LangChain, LlamaIndex, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.
- 70+ models have been optimized/verified on ipex-llm(e.g., Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.
- [2025/05] You can now run DeepSeek V3/R1 671B and Qwen3MoE 235B models with just 1 or 2 Intel Arc GPU (such as A770 or B580) using FlashMoE in ipex-llm.
- [2025/04] We released ipex-llm 2.2.0, which includes Ollama Portable Zip and llama.cpp Portable Zip.⚠️ Warning (for llama.cpp Portable Zip)
 mmap-based model loading in llama.cpp may leak data via side-channels in multi-tenant or shared-host environments.
 To disablemmap, add:--no-mmap 
- [2025/04] We added support of PyTorch 2.6 for Intel GPU.
- [2025/03] We added support for Gemma3 model in the latest llama.cpp Portable Zip.
- [2025/03] We can now run DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon using the latest llama.cpp Portable Zip.
- [2025/02] We added support of llama.cpp Portable Zip for Intel GPU (both Windows and Linux) and NPU (Windows only).
- [2025/02] We added support of Ollama Portable Zip to directly run Ollama on Intel GPU for both Windows and Linux (without the need of manual installations).
- [2025/02] We added support for running vLLM 0.6.6 on Intel Arc GPUs.
- [2025/01] We added the guide for running ipex-llmon Intel Arc B580 GPU.
- [2025/01] We added support for running Ollama 0.5.4 on Intel GPU.
- [2024/12] We added both Python and C++ support for Intel Core Ultra NPU (including 100H, 200V, 200K and 200H series).
More updates
- [2024/11] We added support for running vLLM 0.6.2 on Intel Arc GPUs.
- [2024/07] We added support for running Microsoft's GraphRAG using local LLM on Intel GPU; see the quickstart guide here.
- [2024/07] We added extensive support for Large Multimodal Models, including StableDiffusion, Phi-3-Vision, Qwen-VL, and more.
- [2024/07] We added FP6 support on Intel GPU.
- [2024/06] We added experimental NPU support for Intel Core Ultra processors; see the examples here.
- [2024/06] We added extensive support of pipeline parallel inference, which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
- [2024/06] We added support for running RAGFlow with ipex-llmon Intel GPU.
- [2024/05] ipex-llmnow supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.
- [2024/05] You can now easily run ipex-llminference, serving and finetuning using the Docker images.
- [2024/05] You can now install ipex-llmon Windows using just "one command".
- [2024/04] You can now run Open WebUI on Intel GPU using ipex-llm; see the quickstart here.
- [2024/04] You can now run Llama 3 on Intel GPU using llama.cppandollamawithipex-llm; see the quickstart here.
- [2024/04] ipex-llmnow supports Llama 3 on both Intel GPU and CPU.
- [2024/04] ipex-llmnow provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.
- [2024/03] bigdl-llmhas now becomeipex-llm(see the migration guide here); you may find the originalBigDLproject here.
- [2024/02] ipex-llmnow supports directly loading model from ModelScope (魔搭).
- [2024/02] ipex-llmadded initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
- [2024/02] Users can now use ipex-llmthrough Text-Generation-WebUI GUI.
- [2024/02] ipex-llmnow supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
- [2024/02] ipex-llmnow supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
- [2024/01] Using ipex-llmQLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
- [2023/12] ipex-llmnow supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
- [2023/12] ipex-llmnow supports Mixtral-8x7B on both Intel GPU and CPU.
- [2023/12] ipex-llmnow supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
- [2023/12] ipex-llmnow supports FP8 and FP4 inference on Intel GPU.
- [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llmis available.
- [2023/11] ipex-llmnow supports vLLM continuous batching on both Intel GPU and CPU.
- [2023/10] ipex-llmnow supports QLoRA finetuning on both Intel GPU and CPU.
- [2023/10] ipex-llmnow supports FastChat serving on on both Intel CPU and GPU.
- [2023/09] ipex-llmnow supports Intel GPU (including iGPU, Arc, Flex and MAX).
- [2023/09] ipex-llmtutorial is released.
See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.
| Intel Core Ultra iGPU | Intel Core Ultra NPU | 2-Card Intel Arc dGPUs | Intel Xeon + Arc dGPU | 
|   |   |   |   | 
| Ollama (Mistral-7B, Q4_K) | HuggingFace (Llama3.2-3B, SYM_INT4) | llama.cpp (DeepSeek-R1-Distill-Qwen-32B, Q4_K) | FlashMoE (Qwen3MoE-235B, Q4_K) | 
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).
|   |   | 
You may follow the Benchmarking Guide to run ipex-llm performance benchmark yourself.
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
| Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 | 
|---|---|---|---|---|---|---|
| Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 | 
| Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 | 
| Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 | 
| Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 | 
| Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 | 
| gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 | 
| Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 | 
| Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 | 
| Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 | 
- Ollama: running Ollama on Intel GPU without the need of manual installations
- llama.cpp: running llama.cpp on Intel GPU without the need of manual installations
- Arc B580: running ipex-llmon Intel Arc B580 GPU for Ollama, llama.cpp, PyTorch, HuggingFace, etc.
- NPU: running ipex-llmon Intel NPU in both Python/C++ or llama.cpp API.
- PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of ipex-llm) on Intel GPU for Windows and Linux
- vLLM: running ipex-llmin vLLM on both Intel GPU and CPU
- FastChat: running ipex-llmin FastChat serving on on both Intel GPU and CPU
- Serving on multiple Intel GPUs: running ipex-llmserving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI
- Text-Generation-WebUI: running ipex-llminoobaboogaWebUI
- Axolotl: running ipex-llmin Axolotl for LLM finetuning
- Benchmarking: running  (latency and throughput) benchmarks for ipex-llmon Intel CPU and GPU
- GPU Inference in C++: running llama.cpp,ollama, etc., withipex-llmon Intel GPU
- GPU Inference in Python : running HuggingFace transformers,LangChain,LlamaIndex,ModelScope, etc. withipex-llmon Intel GPU
- vLLM on GPU: running vLLMserving withipex-llmon Intel GPU
- vLLM on CPU: running vLLMserving withipex-llmon Intel CPU
- FastChat on GPU: running FastChatserving withipex-llmon Intel GPU
- VSCode on GPU: running and developing ipex-llmapplications in Python using VSCode on Intel GPU
- GraphRAG: running Microsoft's GraphRAGusing local LLM withipex-llm
- RAGFlow: running RAGFlow(an open-source RAG engine) withipex-llm
- LangChain-Chatchat: running LangChain-Chatchat(Knowledge Base QA using RAG pipeline) withipex-llm
- Coding copilot: running Continue(coding copilot in VSCode) withipex-llm
- Open WebUI: running Open WebUIwithipex-llm
- PrivateGPT: running PrivateGPTto interact with documents withipex-llm
- Dify platform: running ipex-llminDify(production-ready LLM app development platform)
- Windows GPU: installing ipex-llmon Windows with Intel GPU
- Linux GPU: installing ipex-llmon Linux with Intel GPU
- For more details, please refer to the full installation guide
- 
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
 
- 
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
 
- 
- Low-bit models: saving and loading ipex-llmlow-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)
- GGUF: directly loading GGUF models into ipex-llm
- AWQ: directly loading AWQ models into ipex-llm
- GPTQ: directly loading GPTQ models into ipex-llm
 
- Low-bit models: saving and loading 
- Tutorials
Over 70 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
| Model | CPU Example | GPU Example | NPU Example | 
|---|---|---|---|
| LLaMA | link1, link2 | link | |
| LLaMA 2 | link1, link2 | link | Python link, C++ link | 
| LLaMA 3 | link | link | Python link, C++ link | 
| LLaMA 3.1 | link | link | |
| LLaMA 3.2 | link | Python link, C++ link | |
| LLaMA 3.2-Vision | link | ||
| ChatGLM | link | ||
| ChatGLM2 | link | link | |
| ChatGLM3 | link | link | |
| GLM-4 | link | link | |
| GLM-4V | link | link | |
| GLM-Edge | link | Python link | |
| GLM-Edge-V | link | ||
| Mistral | link | link | |
| Mixtral | link | link | |
| Falcon | link | link | |
| MPT | link | link | |
| Dolly-v1 | link | link | |
| Dolly-v2 | link | link | |
| Replit Code | link | link | |
| RedPajama | link1, link2 | ||
| Phoenix | link1, link2 | ||
| StarCoder | link1, link2 | link | |
| Baichuan | link | link | |
| Baichuan2 | link | link | Python link | 
| InternLM | link | link | |
| InternVL2 | link | ||
| Qwen | link | link | |
| Qwen1.5 | link | link | |
| Qwen2 | link | link | Python link, C++ link | 
| Qwen2.5 | link | Python link, C++ link | |
| Qwen-VL | link | link | |
| Qwen2-VL | link | ||
| Qwen2-Audio | link | ||
| Aquila | link | link | |
| Aquila2 | link | link | |
| MOSS | link | ||
| Whisper | link | link | |
| Phi-1_5 | link | link | |
| Flan-t5 | link | link | |
| LLaVA | link | link | |
| CodeLlama | link | link | |
| Skywork | link | ||
| InternLM-XComposer | link | ||
| WizardCoder-Python | link | ||
| CodeShell | link | ||
| Fuyu | link | ||
| Distil-Whisper | link | link | |
| Yi | link | link | |
| BlueLM | link | link | |
| Mamba | link | link | |
| SOLAR | link | link | |
| Phixtral | link | link | |
| InternLM2 | link | link | |
| RWKV4 | link | ||
| RWKV5 | link | ||
| Bark | link | link | |
| SpeechT5 | link | ||
| DeepSeek-MoE | link | ||
| Ziya-Coding-34B-v1.0 | link | ||
| Phi-2 | link | link | |
| Phi-3 | link | link | |
| Phi-3-vision | link | link | |
| Yuan2 | link | link | |
| Gemma | link | link | |
| Gemma2 | link | ||
| DeciLM-7B | link | link | |
| Deepseek | link | link | |
| StableLM | link | link | |
| CodeGemma | link | link | |
| Command-R/cohere | link | link | |
| CodeGeeX2 | link | link | |
| MiniCPM | link | link | Python link, C++ link | 
| MiniCPM3 | link | ||
| MiniCPM-V | link | ||
| MiniCPM-V-2 | link | link | |
| MiniCPM-Llama3-V-2_5 | link | Python link | |
| MiniCPM-V-2_6 | link | link | Python link | 
| MiniCPM-o-2_6 | link | ||
| Janus-Pro | link | ||
| Moonlight | link | ||
| StableDiffusion | link | ||
| Bce-Embedding-Base-V1 | Python link | ||
| Speech_Paraformer-Large | Python link | 
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory
Footnotes
- 
Performance varies by use, configuration and other factors. ipex-llmmay not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩ ↩2