InftyAI · InftyAI-Agent · Sep 2, 2024 · Sep 2, 2024
diff --git a/README.md b/README.md
@@ -27,11 +27,11 @@ Easy, advanced inference platform for large language models on Kubernetes
 ## Feature Overview
 
 - **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
-- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for high performance, like [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
+- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
 - **Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
 - **Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
-- **SOTA Inference (WIP)**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677) to run on Kubernetes.
-- **Various Model Providers**: llmaz automatically loads models from various providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores(aliyun OSS, more on the way).
+- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
+- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores(aliyun OSS, more on the way). llmaz automatically handles the model loading requiring no effort from users.
 - **Multi-hosts Support**: llmaz supports both single-host and multi-hosts scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 1.
 
 ## Quick Start
@@ -110,10 +110,19 @@ curl http://localhost:8080/v1/completions \
 ## Roadmap
 
 - Gateway support for traffic routing
+- Metrics support
 - Serverless support for cloud-agnostic users
 - CLI tool support
 - Model training, fine tuning in the long-term
 
+## Project Structures
+
+```structure
+llmaz # root
+├── llmaz # where the model loader logic locates
+├── pkg # where the main logic for Kubernetes controllers locates
+```
+
 ## Contributions
 
 🚀 All kinds of contributions are welcomed ! Please follow [Contributing](./CONTRIBUTING.md). Thanks to all these contributors.

diff --git a/api/core/v1alpha1/model_types.go b/api/core/v1alpha1/model_types.go
@@ -92,9 +92,9 @@ type Flavor struct {
 	// the requests here will be covered.
 	// +optional
 	Requests v1.ResourceList `json:"requests,omitempty"`
-	// NodeSelector defines the labels to filter specified nodes, like
-	// cloud-provider.com/accelerator: nvidia-a100.
-	// NodeSelector will be auto injected to the Pods as scheduling primitives.
+	// NodeSelector represents the node candidates for Pod placements, if a node doesn't
+	// meet the nodeSelector, it will be filtered out in the resourceFungibility scheduler plugin.
+	// If nodeSelector is empty, it means every node is a candidate.
 	// +optional
 	NodeSelector map[string]string `json:"nodeSelector,omitempty"`
 	// Params stores other useful parameters and will be consumed by the autoscaling components
@@ -107,39 +107,47 @@ type Flavor struct {
 
 type ModelName string
 
-// ModelClaim represents the references to one model.
-// It's a simple config for most of the cases compared to multiModelsClaim.
+// ModelClaim represents claiming for one model, it's the standard claimMode
+// of multiModelsClaim compared to other modes like SpeculativeDecoding.
 type ModelClaim struct {
-	// ModelName represents a list of models, there maybe multiple models here
-	// to support state-of-the-art technologies like speculative decoding.
+	// ModelName represents the name of the Model.
 	ModelName ModelName `json:"modelName,omitempty"`
-	// InferenceFlavors represents a list of flavors with fungibility supports
-	// to serve the model. The flavor names should be a subset of the model
-	// configured flavors. If not set, will use the model configured flavors.
+	// InferenceFlavors represents a list of flavors with fungibility support
+	// to serve the model.
+	// If set, The flavor names should be a subset of the model configured flavors.
+	// If not set, Model configured flavors will be used by default.
 	// +optional
 	InferenceFlavors []FlavorName `json:"inferenceFlavors,omitempty"`
 }
 
-// MultiModelsClaim represents the references to multiple models.
-// It's an advanced and more complicated config comparing to modelClaim.
+type InferenceMode string
+
+const (
+	Standard            InferenceMode = "Standard"
+	SpeculativeDecoding InferenceMode = "SpeculativeDecoding"
+)
+
+// MultiModelsClaim represents claiming for multiple models with different claimModes,
+// like standard or speculative-decoding to support different inference scenarios.
 type MultiModelsClaim struct {
 	// ModelNames represents a list of models, there maybe multiple models here
 	// to support state-of-the-art technologies like speculative decoding.
+	// If the composedMode is SpeculativeDecoding, the first model is the target model,
+	// and the second model is the draft model.
 	// +kubebuilder:validation:MinItems=1
 	ModelNames []ModelName `json:"modelNames,omitempty"`
+	// Mode represents the paradigm to serve the model, whether via a standard way
+	// or via an advanced technique like SpeculativeDecoding.
+	// +kubebuilder:default=Standard
+	// +kubebuilder:validation:Enum={Standard,SpeculativeDecoding}
+	// +optional
+	InferenceMode InferenceMode `json:"inferenceMode,omitempty"`
 	// InferenceFlavors represents a list of flavors with fungibility supported
 	// to serve the model.
 	// - If not set, always apply with the 0-index model by default.
 	// - If set, will lookup the flavor names following the model orders.
 	// +optional
 	InferenceFlavors []FlavorName `json:"inferenceFlavors,omitempty"`
-	// Rate works only when multiple claims declared, it represents the replicas rates of
-	// the sub-workload, like when claim1.rate:claim2.rate = 1:2 and 3 replicas defined in
-	// workload, then sub-workload1 will have 1 replica, and sub-workload2 will have 2 replicas.
-	// This is mostly designed for state-of-the-art technology called splitwise, the prefill
-	// and decode phase will be separated and requires different accelerators.
-	// The sum of the rates should be divisible by replicas.
-	Rate *int32 `json:"rate,omitempty"`
 }
 
 // ModelSpec defines the desired state of Model
@@ -151,7 +159,8 @@ type ModelSpec struct {
 	// the model such as loading from huggingface, OCI registry, s3, host path and so on.
 	Source ModelSource `json:"source"`
 	// InferenceFlavors represents the accelerator requirements to serve the model.
-	// Flavors are fungible following the priority of slice order.
+	// Flavors are fungible following the priority represented by the slice order.
+	// +kubebuilder:validation:MaxItems=8
 	// +optional
 	InferenceFlavors []Flavor `json:"inferenceFlavors,omitempty"`
 }

diff --git a/api/core/v1alpha1/zz_generated.deepcopy.go b/api/core/v1alpha1/zz_generated.deepcopy.go
diff --git a/api/inference/v1alpha1/config_types.go b/api/inference/v1alpha1/config_types.go
@@ -39,6 +39,7 @@ type BackendConfig struct {
 	// +optional
 	Version *string `json:"version,omitempty"`
 	// Args represents the arguments passed to the backend.
+	// You can add new args or overwrite the default args.
 	// +optional
 	Args []string `json:"args,omitempty"`
 	// Envs represents the environments set to the container.

diff --git a/api/inference/v1alpha1/playground_types.go b/api/inference/v1alpha1/playground_types.go
@@ -28,19 +28,17 @@ type PlaygroundSpec struct {
 	// +kubebuilder:default=1
 	// +optional
 	Replicas *int32 `json:"replicas,omitempty"`
-	// ModelClaim represents one modelClaim, it's a simple configuration
-	// compared to multiModelsClaims only work for one model and one claim.
-	// ModelClaim and multiModelsClaims are exclusive configured.
-	// Note: properties (nodeSelectors, resources, e.g.) of the model flavors
-	// will be applied to the workload if not exist.
+	// ModelClaim represents claiming for one model, it's the standard claimMode
+	// of multiModelsClaim compared to other modes like SpeculativeDecoding.
+	// Most of the time, modelClaim is enough.
+	// ModelClaim and multiModelsClaim are exclusive configured.
 	// +optional
 	ModelClaim *coreapi.ModelClaim `json:"modelClaim,omitempty"`
-	// MultiModelsClaims represents multiple modelClaim, which is useful when different
-	// sub-workload has different accelerator requirements, like the state-of-the-art
-	// technology called splitwise, the workload template is shared by both.
-	// ModelClaim and multiModelsClaims are exclusive configured.
+	// MultiModelsClaim represents claiming for multiple models with different claimModes,
+	// like standard or speculative-decoding to support different inference scenarios.
+	// ModelClaim and multiModelsClaim are exclusive configured.
 	// +optional
-	MultiModelsClaims []coreapi.MultiModelsClaim `json:"multiModelsClaims,omitempty"`
+	MultiModelsClaim *coreapi.MultiModelsClaim `json:"multiModelsClaim,omitempty"`
 	// BackendConfig represents the inference backend configuration
 	// under the hood, e.g. vLLM, which is the default backend.
 	// +optional

diff --git a/api/inference/v1alpha1/service_types.go b/api/inference/v1alpha1/service_types.go
@@ -27,14 +27,9 @@ import (
 // Service controller will maintain multi-flavor of workloads with
 // different accelerators for cost or performance considerations.
 type ServiceSpec struct {
-	// MultiModelsClaims represents multiple modelClaim, which is useful when different
-	// sub-workload has different accelerator requirements, like the state-of-the-art
-	// technology called splitwise, the workload template is shared by both.
-	// Most of the time, one modelClaim is enough.
-	// Note: properties (nodeSelectors, resources, e.g.) of the model flavors
-	// will be applied to the workload if not exist.
-	// +kubebuilder:validation:MinItems=1
-	MultiModelsClaims []coreapi.MultiModelsClaim `json:"multiModelsClaims,omitempty"`
+	// MultiModelsClaim represents claiming for multiple models with different claimModes,
+	// like standard or speculative-decoding to support different inference scenarios.
+	MultiModelsClaim coreapi.MultiModelsClaim `json:"multiModelsClaim,omitempty"`
 	// WorkloadTemplate defines the underlying workload layout and configuration.
 	// Note: the LWS spec might be twisted with various LWS instances to support
 	// accelerator fungibility or other cutting-edge researches.

diff --git a/api/inference/v1alpha1/zz_generated.deepcopy.go b/api/inference/v1alpha1/zz_generated.deepcopy.go
diff --git a/client-go/applyconfiguration/core/v1alpha1/multimodelsclaim.go b/client-go/applyconfiguration/core/v1alpha1/multimodelsclaim.go
diff --git a/client-go/applyconfiguration/inference/v1alpha1/playgroundspec.go b/client-go/applyconfiguration/inference/v1alpha1/playgroundspec.go
diff --git a/client-go/applyconfiguration/inference/v1alpha1/servicespec.go b/client-go/applyconfiguration/inference/v1alpha1/servicespec.go