Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ Easy, advanced inference platform for large language models on Kubernetes
## Feature Overview

- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for high performance, like [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
- **Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
- **Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- **SOTA Inference (WIP)**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677) to run on Kubernetes.
- **Various Model Providers**: llmaz automatically loads models from various providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores(aliyun OSS, more on the way).
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores(aliyun OSS, more on the way). llmaz automatically handles the model loading requiring no effort from users.
- **Multi-hosts Support**: llmaz supports both single-host and multi-hosts scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 1.

## Quick Start
Expand Down Expand Up @@ -110,10 +110,19 @@ curl http://localhost:8080/v1/completions \
## Roadmap

- Gateway support for traffic routing
- Metrics support
- Serverless support for cloud-agnostic users
- CLI tool support
- Model training, fine tuning in the long-term

## Project Structures

```structure
llmaz # root
├── llmaz # where the model loader logic locates
├── pkg # where the main logic for Kubernetes controllers locates
```

## Contributions

🚀 All kinds of contributions are welcomed ! Please follow [Contributing](./CONTRIBUTING.md). Thanks to all these contributors.
Expand Down
49 changes: 29 additions & 20 deletions api/core/v1alpha1/model_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,9 @@ type Flavor struct {
// the requests here will be covered.
// +optional
Requests v1.ResourceList `json:"requests,omitempty"`
// NodeSelector defines the labels to filter specified nodes, like
// cloud-provider.com/accelerator: nvidia-a100.
// NodeSelector will be auto injected to the Pods as scheduling primitives.
// NodeSelector represents the node candidates for Pod placements, if a node doesn't
// meet the nodeSelector, it will be filtered out in the resourceFungibility scheduler plugin.
// If nodeSelector is empty, it means every node is a candidate.
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
// Params stores other useful parameters and will be consumed by the autoscaling components
Expand All @@ -107,39 +107,47 @@ type Flavor struct {

type ModelName string

// ModelClaim represents the references to one model.
// It's a simple config for most of the cases compared to multiModelsClaim.
// ModelClaim represents claiming for one model, it's the standard claimMode
// of multiModelsClaim compared to other modes like SpeculativeDecoding.
type ModelClaim struct {
// ModelName represents a list of models, there maybe multiple models here
// to support state-of-the-art technologies like speculative decoding.
// ModelName represents the name of the Model.
ModelName ModelName `json:"modelName,omitempty"`
// InferenceFlavors represents a list of flavors with fungibility supports
// to serve the model. The flavor names should be a subset of the model
// configured flavors. If not set, will use the model configured flavors.
// InferenceFlavors represents a list of flavors with fungibility support
// to serve the model.
// If set, The flavor names should be a subset of the model configured flavors.
// If not set, Model configured flavors will be used by default.
// +optional
InferenceFlavors []FlavorName `json:"inferenceFlavors,omitempty"`
}

// MultiModelsClaim represents the references to multiple models.
// It's an advanced and more complicated config comparing to modelClaim.
type InferenceMode string

const (
Standard InferenceMode = "Standard"
SpeculativeDecoding InferenceMode = "SpeculativeDecoding"
)

// MultiModelsClaim represents claiming for multiple models with different claimModes,
// like standard or speculative-decoding to support different inference scenarios.
type MultiModelsClaim struct {
// ModelNames represents a list of models, there maybe multiple models here
// to support state-of-the-art technologies like speculative decoding.
// If the composedMode is SpeculativeDecoding, the first model is the target model,
// and the second model is the draft model.
// +kubebuilder:validation:MinItems=1
ModelNames []ModelName `json:"modelNames,omitempty"`
// Mode represents the paradigm to serve the model, whether via a standard way
// or via an advanced technique like SpeculativeDecoding.
// +kubebuilder:default=Standard
// +kubebuilder:validation:Enum={Standard,SpeculativeDecoding}
// +optional
InferenceMode InferenceMode `json:"inferenceMode,omitempty"`
// InferenceFlavors represents a list of flavors with fungibility supported
// to serve the model.
// - If not set, always apply with the 0-index model by default.
// - If set, will lookup the flavor names following the model orders.
// +optional
InferenceFlavors []FlavorName `json:"inferenceFlavors,omitempty"`
// Rate works only when multiple claims declared, it represents the replicas rates of
// the sub-workload, like when claim1.rate:claim2.rate = 1:2 and 3 replicas defined in
// workload, then sub-workload1 will have 1 replica, and sub-workload2 will have 2 replicas.
// This is mostly designed for state-of-the-art technology called splitwise, the prefill
// and decode phase will be separated and requires different accelerators.
// The sum of the rates should be divisible by replicas.
Rate *int32 `json:"rate,omitempty"`
}

// ModelSpec defines the desired state of Model
Expand All @@ -151,7 +159,8 @@ type ModelSpec struct {
// the model such as loading from huggingface, OCI registry, s3, host path and so on.
Source ModelSource `json:"source"`
// InferenceFlavors represents the accelerator requirements to serve the model.
// Flavors are fungible following the priority of slice order.
// Flavors are fungible following the priority represented by the slice order.
// +kubebuilder:validation:MaxItems=8
// +optional
InferenceFlavors []Flavor `json:"inferenceFlavors,omitempty"`
}
Expand Down
5 changes: 0 additions & 5 deletions api/core/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions api/inference/v1alpha1/config_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ type BackendConfig struct {
// +optional
Version *string `json:"version,omitempty"`
// Args represents the arguments passed to the backend.
// You can add new args or overwrite the default args.
// +optional
Args []string `json:"args,omitempty"`
// Envs represents the environments set to the container.
Expand Down
18 changes: 8 additions & 10 deletions api/inference/v1alpha1/playground_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,19 +28,17 @@ type PlaygroundSpec struct {
// +kubebuilder:default=1
// +optional
Replicas *int32 `json:"replicas,omitempty"`
// ModelClaim represents one modelClaim, it's a simple configuration
// compared to multiModelsClaims only work for one model and one claim.
// ModelClaim and multiModelsClaims are exclusive configured.
// Note: properties (nodeSelectors, resources, e.g.) of the model flavors
// will be applied to the workload if not exist.
// ModelClaim represents claiming for one model, it's the standard claimMode
// of multiModelsClaim compared to other modes like SpeculativeDecoding.
// Most of the time, modelClaim is enough.
// ModelClaim and multiModelsClaim are exclusive configured.
// +optional
ModelClaim *coreapi.ModelClaim `json:"modelClaim,omitempty"`
// MultiModelsClaims represents multiple modelClaim, which is useful when different
// sub-workload has different accelerator requirements, like the state-of-the-art
// technology called splitwise, the workload template is shared by both.
// ModelClaim and multiModelsClaims are exclusive configured.
// MultiModelsClaim represents claiming for multiple models with different claimModes,
// like standard or speculative-decoding to support different inference scenarios.
// ModelClaim and multiModelsClaim are exclusive configured.
// +optional
MultiModelsClaims []coreapi.MultiModelsClaim `json:"multiModelsClaims,omitempty"`
MultiModelsClaim *coreapi.MultiModelsClaim `json:"multiModelsClaim,omitempty"`
// BackendConfig represents the inference backend configuration
// under the hood, e.g. vLLM, which is the default backend.
// +optional
Expand Down
11 changes: 3 additions & 8 deletions api/inference/v1alpha1/service_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,9 @@ import (
// Service controller will maintain multi-flavor of workloads with
// different accelerators for cost or performance considerations.
type ServiceSpec struct {
// MultiModelsClaims represents multiple modelClaim, which is useful when different
// sub-workload has different accelerator requirements, like the state-of-the-art
// technology called splitwise, the workload template is shared by both.
// Most of the time, one modelClaim is enough.
// Note: properties (nodeSelectors, resources, e.g.) of the model flavors
// will be applied to the workload if not exist.
// +kubebuilder:validation:MinItems=1
MultiModelsClaims []coreapi.MultiModelsClaim `json:"multiModelsClaims,omitempty"`
// MultiModelsClaim represents claiming for multiple models with different claimModes,
// like standard or speculative-decoding to support different inference scenarios.
MultiModelsClaim coreapi.MultiModelsClaim `json:"multiModelsClaim,omitempty"`
// WorkloadTemplate defines the underlying workload layout and configuration.
// Note: the LWS spec might be twisted with various LWS instances to support
// accelerator fungibility or other cutting-edge researches.
Expand Down
18 changes: 5 additions & 13 deletions api/inference/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 11 additions & 11 deletions client-go/applyconfiguration/core/v1alpha1/multimodelsclaim.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 9 additions & 14 deletions client-go/applyconfiguration/inference/v1alpha1/playgroundspec.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 8 additions & 13 deletions client-go/applyconfiguration/inference/v1alpha1/servicespec.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading