Note
OpenArc is under active development. Expect breaking changes.
OpenArc is an inference engine built with Optimum-Intel to leverage hardware acceleration on Intel CPUs, GPUs and NPUs through OpenVINO runtime that integrates closely with Huggingface Transformers.
Under the hood OpenArc implements fastAPI over a growing collection of Transformers integrated AutoModel classes from Optimum-Intel. These enable accelerating inference on a wide range of tasks, models and source frameworks.
OpenArc currently supports text generation and text generation with vision. Support for speculative decoding, generating embeddings, speech tasks, image generation, PaddleOCR, and others are planned.
Currently implemented:
OpenArc enables a similar workflow to what's possible with Ollama, LM-Studio or OpenRouter but with hardware acceleration from OpenVINO C++ runtime.
- OpenAI compatible endpoints
- Validated OpenWebUI support, but it should work elsewhere
- Load multiple vision/text models concurrently on multiple devices for hotswap/multi agent workflows
- Most HuggingFace text generation models
- Growing set of vision capable LLMs:
- Qwen2-VL
- Qwen2.5-VL
- Gemma 3
- Load models with OpenVINO optimizations
- Build conversion commands
- See loaded models and chosen optimizations
- Unload models and view metadata about
- Query detected devices
- Query device properties
- View tokenizer data
- View architecture metadata from config.json
- ttft: time to generate first token
- generation_time : time to generate the whole response
- number of tokens: total generated tokens for that request
- tokens per second: measures throughput.
- average token latency: helpful for optimizing zero shot classification tasks
OpenArc has been built on top of the OpenVINO runtime; as a result OpenArc supports the same range of hardware but requires device specifc drivers this document will not cover in-depth.
Supported operating system are a bit different for each class of device. Please review system requirments for OpenVINO 2025.0.0.0 to learn which
- Windows versions are supported
- Linux distributions are supported
- kernel versions are required
- My system uses version 6.9.4-060904-generic with Ubuntu 24.04 LTS.
- commands for different package managers
- other required dependencies for GPU and NPU
If you need help installing drivers: - Join the Discord - Open an issue - Use Linux Drivers - Use Windows Drivers
CPU
Intel® Core™ Ultra Series 1 and Series 2 (Windows only )
Intel® Xeon® 6 processor (preview)
Intel Atom® Processor X Series
Intel Atom® processor with Intel® SSE4.2 support
Intel® Pentium® processor N4200/5, N3350/5, N3450/5 with Intel® HD Graphics
6th - 14th generation Intel® Core™ processors
1st - 5th generation Intel® Xeon® Scalable Processors
ARM CPUs with armv7a and higher, ARM64 CPUs with arm64-v8a and higher, Apple® Mac with Apple silicon
GPU
Intel® Arc™ GPU Series
Intel® HD Graphics
Intel® UHD Graphics
Intel® Iris® Pro Graphics
Intel® Iris® Xe Graphics
Intel® Iris® Xe Max Graphics
Intel® Data Center GPU Flex Series
Intel® Data Center GPU Max Series
NPU
Intel® Core Ultra Series
This was a bit harder to list out as the system requirments page does not include an itemized list. However, it is safe to assume that if a device contains an Intel NPU it will be supported.
The Gradio dashboard has tools for querying your device under the Tools tab.
Create the conda environment:
conda env create -f environment.yaml
Set your API key as an environment variable:
export OPENARC_API_KEY=<you-know-for-search>
Build Optimum-Intel from source to get the latest support:
pip install optimum[openvino]+https://github.com/huggingface/optimum-intel
-
Install Miniconda from here
-
Navigate to the directory containing the environment.yaml file and run
conda env create -f environment.yaml
Set your API key as an environment variable:
setx OPENARC_API_KEY=<you-know-for-search>
Build Optimum-Intel from source to get the latest support:
pip install optimum[openvino]+https://github.com/huggingface/optimum-intel
[!Tips]
- Avoid setting up the environment from IDE extensions.
- Try not to use the environment for other ML projects. Soon we will have uv.
OpenArc has two components:
- start_server.py - launches the inference server
- start_dashboard.py - launches the dashboard, which manages the server and provides some useful tools
To launch the inference server run
python start_server.py --host 0.0.0.0 --openarc-port 8000
host: defines the ip address to bind the server to
openarc_port: defines the port which can be used to access the server
To launch the dashboard run
python start_dashboard.py --openarc-port 8000
openarc_port: defines the port which requests from the dashboard use
Run these in two different terminals.
Note
Gradio handles ports natively so the port number does not need to be set. Default is 7860 but it will increment if another instance of gradio is running.
Note
I'm only going to cover the basics on OpenWebUI here. To learn more and set it up check out the OpenWebUI docs.
-
From the Connections menu add a new connection
-
Enter the server address and port where OpenArc is running followed by /v1 Example: http://0.0.0.0:8000/v1
-
Here you need to set the API key manually
-
When you hit the refresh button OpenWebUI sends a GET request to the OpenArc server to get the list of models at v1/models
Serverside logs should report:
"GET /v1/models HTTP/1.1" 200 OK
- Load the model you want to use from the dashboard
- Select the connection you just created and use the refresh button to update the list of models
- if you use API keys and have a list of models these might be towards the bottom
Convert to OpenVINO IR
There are a few source of models which can be used with OpenArc;
-
- My repo contains preconverted models for a variety of architectures and usecases
- OpenArc supports almost all of them
- These get updated regularly so check back often!
You can easily craft conversion commands using my HF Space, Optimum-CLI-Tool_tool or in the OpenArc Dashboard.
This tool respects the positional arguments defined here, then execute commands in the OpenArc environment.
Models | Compressed Weights |
---|---|
Ministral-3b-instruct-int4_asym-ov | 1.85 GB |
Hermes-3-Llama-3.2-3B-awq-ov | 1.8 GB |
Llama-3.1-Tulu-3-8B-int4_asym-ov | 4.68 GB |
Qwen2.5-7B-Instruct-1M-int4-ov | 4.46 GB |
Meta-Llama-3.1-8B-SurviveV3-int4_asym-awq-se-wqe-ov | 4.68 GB |
Falcon3-10B-Instruct-int4_asym-ov | 5.74 GB |
Echo9Zulu/phi-4-int4_asym-awq-ov | 8.11 GB |
DeepSeek-R1-Distill-Qwen-14B-int4-awq-ov | 7.68 GB |
Phi-4-o1-int4_asym-awq-weight_quantization_error-ov | 8.11 GB |
Mistral-Small-24B-Instruct-2501-int4_asym-ov | 12.9 GB |
Documentation on choosing parameters for conversion is coming soon; we also have a channel in Discord for this topic.
Note
The optimum CLI tool integrates several different APIs from several different Intel projects; it is a better alternative than using APIs in from_pretrained() methods. It references prebuilt export configurations for each supported model architecture meaning not all models are supported but most are. If you use the CLI tool and get an error about an unsupported architecture follow the link, open an issue with references to the model card and the maintainers will get back to you.
Note
A naming convention for openvino converted models is coming soon.
Notes on the test:
- No openvino optimization parameters were used
- Fixed input length
- I sent one user message
- Quant strategies for models are not considered
- I converted each of these models myself (I'm working on standardizing model cards to share this information more directly)
- OpenVINO generates a cache on first inference so metrics are on second generation
- Seconds were used for readability
Test System:
CPU: Xeon W-2255 (10c, 20t) @3.7ghz GPU: 3x Arc A770 16GB Asrock Phantom RAM: 128gb DDR4 ECC 2933 mhz Disk: 4tb ironwolf, 1tb 970 Evo
OS: Ubuntu 24.04 Kernel: 6.9.4-060904-generic
Model | Prompt Processing (sec) | Throughput (t/sec) | Duration (sec) | Size (GB) |
---|---|---|---|---|
Phi-4-mini-instruct-int4_asym-gptq-ov | 0.41 | 47.25 | 3.10 | 2.3 |
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov | 0.27 | 64.18 | 0.98 | 1.8 |
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov | 0.32 | 47.99 | 2.96 | 4.7 |
phi-4-int4_asym-awq-se-ov | 0.30 | 25.27 | 5.32 | 8.1 |
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov | 0.42 | 25.23 | 1.56 | 8.4 |
Mistral-Small-24B-Instruct-2501-int4_asym-ov | 0.36 | 18.81 | 7.11 | 12.9 |
Model | Prompt Processing (sec) | Throughput (t/sec) | Duration (sec) | Size (GB) |
---|---|---|---|---|
Phi-4-mini-instruct-int4_asym-gptq-ov | 1.02 | 20.44 | 7.23 | 2.3 |
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov | 1.06 | 23.66 | 3.01 | 1.8 |
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov | 2.53 | 13.22 | 12.14 | 4.7 |
phi-4-int4_asym-awq-se-ov | 4 | 6.63 | 23.14 | 8.1 |
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov | 5.02 | 7.25 | 11.09 | 8.4 |
Mistral-Small-24B-Instruct-2501-int4_asym-ov | 6.88 | 4.11 | 37.5 | 12.9 |
Nous-Hermes-2-Mixtral-8x7B-DPO-int4-sym-se-ov | 15.56 | 6.67 | 34.60 | 24.2 |
Learn more about how to leverage your Intel devices for Machine Learning:
OpenArc stands on the shoulders of several other projects:
Thank for yoru work!!