An Envoy External Processor (ExtProc) that acts as an external Mixture-of-Models (MoM) router. It intelligently directs OpenAI API requests to the most suitable backend model from a defined pool based on semantic understanding of the request's intent. This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives within a model, this system selects the best entire model for the nature of the task.
As such, the overall inference accuracy is improved by using a pool of models that are better suited for different types of tasks:
The detailed design doc can be found here.
The screenshot below shows the LLM Router dashboard in Grafana.
The router is implemented in two ways: Golang (with Rust FFI based on Candle) and Python. Benchmarking will be conducted to determine the best implementation.
This listens for incoming requests and uses the ExtProc filter.
make run-envoy
This builds the Rust binding and the Go router, then starts the ExtProc gRPC server that Envoy communicates with.
make run-router
Once both Envoy and the router are running, you can test the routing logic using predefined prompts:
make test-prompt
This will send curl requests simulating different types of user prompts (Math, Creative Writing, General) to the Envoy endpoint (http://localhost:8801
). The router should direct these to the appropriate backend model configured in config/config.yaml
.