Skip to content

SearchSavior/OpenArc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to OpenARC

Discord

Note

OpenArc is under active development. Expect breaking changes.

OpenArc is an inference engine built with Optimum-Intel to leverage hardware acceleration on Intel CPUs, GPUs and NPUs through OpenVINO runtime that integrates closely with Huggingface Transformers.

Under the hood OpenArc implements fastAPI over a growing collection of Transformers integrated AutoModel classes from Optimum-Intel. These enable accelerating inference on a wide range of tasks, models and source frameworks.

OpenArc currently supports text generation and text generation with vision. Support for speculative decoding, generating embeddings, speech tasks, image generation, PaddleOCR, and others are planned.

Currently implemented:

OVModelForCausalLM

OVModelForVisualCausalLM

OpenArc enables a similar workflow to what's possible with Ollama, LM-Studio or OpenRouter but with hardware acceleration from OpenVINO C++ runtime.

Features

  • OpenAI compatible endpoints
  • Validated OpenWebUI support, but it should work elsewhere
  • Load multiple vision/text models concurrently on multiple devices for hotswap/multi agent workflows
  • Most HuggingFace text generation models
  • Growing set of vision capable LLMs:
    • Qwen2-VL
    • Qwen2.5-VL
    • Gemma 3

Gradio management dashboard

  • Load models with OpenVINO optimizations
  • Build conversion commands
  • See loaded models and chosen optimizations
  • Unload models and view metadata about
  • Query detected devices
  • Query device properties
  • View tokenizer data
  • View architecture metadata from config.json

Performance metrics on every completion

  • ttft: time to generate first token
  • generation_time : time to generate the whole response
  • number of tokens: total generated tokens for that request
  • tokens per second: measures throughput.
  • average token latency: helpful for optimizing zero shot classification tasks

System Requirments

OpenArc has been built on top of the OpenVINO runtime; as a result OpenArc supports the same range of hardware but requires device specifc drivers this document will not cover in-depth.

Supported operating system are a bit different for each class of device. Please review system requirments for OpenVINO 2025.0.0.0 to learn which

  • Windows versions are supported
  • Linux distributions are supported
  • kernel versions are required
    • My system uses version 6.9.4-060904-generic with Ubuntu 24.04 LTS.
  • commands for different package managers
  • other required dependencies for GPU and NPU

If you need help installing drivers: - Join the Discord - Open an issue - Use Linux Drivers - Use Windows Drivers

CPU
Intel® Core™ Ultra Series 1 and Series 2 (Windows only )

Intel® Xeon® 6 processor (preview)

Intel Atom® Processor X Series
    
Intel Atom® processor with Intel® SSE4.2 support

Intel® Pentium® processor N4200/5, N3350/5, N3450/5 with Intel® HD Graphics

6th - 14th generation Intel® Core™ processors

1st - 5th generation Intel® Xeon® Scalable Processors

ARM CPUs with armv7a and higher, ARM64 CPUs with arm64-v8a and higher, Apple® Mac with Apple silicon
GPU
Intel® Arc™ GPU Series

Intel® HD Graphics

Intel® UHD Graphics

Intel® Iris® Pro Graphics

Intel® Iris® Xe Graphics

Intel® Iris® Xe Max Graphics

Intel® Data Center GPU Flex Series

Intel® Data Center GPU Max Series
NPU
Intel® Core Ultra Series

This was a bit harder to list out as the system requirments page does not include an itemized list. However, it is safe to assume that if a device contains an Intel NPU it will be supported.

The Gradio dashboard has tools for querying your device under the Tools tab.

Ubuntu

Create the conda environment:

conda env create -f environment.yaml

Set your API key as an environment variable:

export OPENARC_API_KEY=<you-know-for-search>

Build Optimum-Intel from source to get the latest support:

pip install optimum[openvino]+https://github.com/huggingface/optimum-intel

Windows

  1. Install Miniconda from here

  2. Navigate to the directory containing the environment.yaml file and run

    conda env create -f environment.yaml

Set your API key as an environment variable:

setx OPENARC_API_KEY=<you-know-for-search>

Build Optimum-Intel from source to get the latest support:

pip install optimum[openvino]+https://github.com/huggingface/optimum-intel

[!Tips]

  • Avoid setting up the environment from IDE extensions.
  • Try not to use the environment for other ML projects. Soon we will have uv.

Usage

OpenArc has two components:

  • start_server.py - launches the inference server
  • start_dashboard.py - launches the dashboard, which manages the server and provides some useful tools

To launch the inference server run

	python start_server.py --host 0.0.0.0 --openarc-port 8000

host: defines the ip address to bind the server to

openarc_port: defines the port which can be used to access the server

To launch the dashboard run

	python start_dashboard.py --openarc-port 8000

openarc_port: defines the port which requests from the dashboard use

Run these in two different terminals.

Note

Gradio handles ports natively so the port number does not need to be set. Default is 7860 but it will increment if another instance of gradio is running.

OpenWebUI

Note

I'm only going to cover the basics on OpenWebUI here. To learn more and set it up check out the OpenWebUI docs.

  • From the Connections menu add a new connection

  • Enter the server address and port where OpenArc is running followed by /v1 Example: http://0.0.0.0:8000/v1

  • Here you need to set the API key manually

  • When you hit the refresh button OpenWebUI sends a GET request to the OpenArc server to get the list of models at v1/models

Serverside logs should report:

"GET /v1/models HTTP/1.1" 200 OK

Usage:

  • Load the model you want to use from the dashboard
  • Select the connection you just created and use the refresh button to update the list of models
  • if you use API keys and have a list of models these might be towards the bottom

Convert to OpenVINO IR

There are a few source of models which can be used with OpenArc;

You can easily craft conversion commands using my HF Space, Optimum-CLI-Tool_tool or in the OpenArc Dashboard.

This tool respects the positional arguments defined here, then execute commands in the OpenArc environment.

Models Compressed Weights
Ministral-3b-instruct-int4_asym-ov 1.85 GB
Hermes-3-Llama-3.2-3B-awq-ov 1.8 GB
Llama-3.1-Tulu-3-8B-int4_asym-ov 4.68 GB
Qwen2.5-7B-Instruct-1M-int4-ov 4.46 GB
Meta-Llama-3.1-8B-SurviveV3-int4_asym-awq-se-wqe-ov 4.68 GB
Falcon3-10B-Instruct-int4_asym-ov 5.74 GB
Echo9Zulu/phi-4-int4_asym-awq-ov 8.11 GB
DeepSeek-R1-Distill-Qwen-14B-int4-awq-ov 7.68 GB
Phi-4-o1-int4_asym-awq-weight_quantization_error-ov 8.11 GB
Mistral-Small-24B-Instruct-2501-int4_asym-ov 12.9 GB

Documentation on choosing parameters for conversion is coming soon; we also have a channel in Discord for this topic.

Note

The optimum CLI tool integrates several different APIs from several different Intel projects; it is a better alternative than using APIs in from_pretrained() methods. It references prebuilt export configurations for each supported model architecture meaning not all models are supported but most are. If you use the CLI tool and get an error about an unsupported architecture follow the link, open an issue with references to the model card and the maintainers will get back to you.

Note

A naming convention for openvino converted models is coming soon.

Performance with OpenVINO runtime

Notes on the test:

  • No openvino optimization parameters were used
  • Fixed input length
  • I sent one user message
  • Quant strategies for models are not considered
  • I converted each of these models myself (I'm working on standardizing model cards to share this information more directly)
  • OpenVINO generates a cache on first inference so metrics are on second generation
  • Seconds were used for readability

Test System:

CPU: Xeon W-2255 (10c, 20t) @3.7ghz GPU: 3x Arc A770 16GB Asrock Phantom RAM: 128gb DDR4 ECC 2933 mhz Disk: 4tb ironwolf, 1tb 970 Evo

OS: Ubuntu 24.04 Kernel: 6.9.4-060904-generic

Prompt: "We don't even have a chat template so strap in and let it ride!" max_new_tokens= 128

GPU Performance: 1x Arc A770

Model Prompt Processing (sec) Throughput (t/sec) Duration (sec) Size (GB)
Phi-4-mini-instruct-int4_asym-gptq-ov 0.41 47.25 3.10 2.3
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov 0.27 64.18 0.98 1.8
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov 0.32 47.99 2.96 4.7
phi-4-int4_asym-awq-se-ov 0.30 25.27 5.32 8.1
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov 0.42 25.23 1.56 8.4
Mistral-Small-24B-Instruct-2501-int4_asym-ov 0.36 18.81 7.11 12.9

CPU Performance: Xeon W-2255

Model Prompt Processing (sec) Throughput (t/sec) Duration (sec) Size (GB)
Phi-4-mini-instruct-int4_asym-gptq-ov 1.02 20.44 7.23 2.3
Hermes-3-Llama-3.2-3B-int4_sym-awq-se-ov 1.06 23.66 3.01 1.8
Llama-3.1-Nemotron-Nano-8B-v1-int4_sym-awq-se-ov 2.53 13.22 12.14 4.7
phi-4-int4_asym-awq-se-ov 4 6.63 23.14 8.1
DeepSeek-R1-Distill-Qwen-14B-int4_sym-awq-se-ov 5.02 7.25 11.09 8.4
Mistral-Small-24B-Instruct-2501-int4_asym-ov 6.88 4.11 37.5 12.9
Nous-Hermes-2-Mixtral-8x7B-DPO-int4-sym-se-ov 15.56 6.67 34.60 24.2

Resources


Learn more about how to leverage your Intel devices for Machine Learning:

openvino_notebooks

Inference with Optimum-Intel

Optimum-Intel Transformers

NPU Devices

Acknowledgments

OpenArc stands on the shoulders of several other projects:

Optimum-Intel

OpenVINO

OpenVINO GenAI

Transformers

FastAPI

Thank for yoru work!!