diff --git a/docs/design/queue.md b/docs/design/queue.md new file mode 100644 index 0000000000..9ddd6d0960 --- /dev/null +++ b/docs/design/queue.md @@ -0,0 +1,113 @@ +# Queue + +[@k82cn](http://github.com/k82cn); April 17, 2019 + +## Motivation + +`Queue` was introduced in [kube-batch](http://github.com/kubernetes-sigs/kube-batch) long time ago as an internal feature, which makes all jobs are submitted to the same queue, named `default`. As more and more users would like to share resources with each other by queue, this proposal is going to cover primary features of queue achieve that. + +## Function Specification + +The queue is cluster level, so the user from different namespaces can share resource within a `Queue`. The following section defines the api of queue. + +### API + +```go +type Queue struct { + metav1.TypeMeta `json:",inline"` + + metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` + + // Specification of the desired behavior of a queue + // +optional + Spec QueueSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"` + + // Current status of Queue + // +optional + Status QueueStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` +} + +type QueueSpec struct { + // The weight of queue to share the resources with each other. + Weight int32 `json:"weight,omitempty" protobuf:"bytes,1,opt,name=weight"` +} + +type QueueStatus struct { + // The number of job in Unknown status + Unknown int32 `json:"running,omitempty" protobuf:"bytes,1,opt,name=running"` + // The number of job in Running status + Running int32 `json:"running,omitempty" protobuf:"bytes,2,opt,name=running"` + // The number of job in Pending status + Pending int32 `json:"pending,omitempty" protobuf:"bytes,3,opt,name=pending"` + // The number of job in Completed status + Completed int32 `json:"completed,omitempty" protobuf:"bytes,4,opt,name=completed"` + // The number of job in Failed status + Failed int32 `json:"failed,omitempty" protobuf:"bytes,5,opt,name=failed"` + // The number of job in Aborted status + Aborted int32 `json:"aborted,omitempty" protobuf:"bytes,6,opt,name=aborted"` +} +``` + +### QueueController + +The `QueueController` will manage the lifecycle of queue: + +1. Watching `PodGroup`/`Job` for status +2. If `Queue` was deleted, also delete all related `PodGroup`/`Job` in the queue + +### Admission Controller + +The admission controller will check `PodGroup`/`Job` 's queue when creation: + +1. if the queue does not exist, the creation will be rejected +2. if the queue is releasing, the creation will be also rejected + +### Feature Interaction + +#### Customized Job/PodGroup + +If the `PodGroup` is created by customized controller, the `QueueController` will count those `PodGroup` into `Unknown` status; because `PodGroup` focus on scheduling specification which did not include customized job's status. + +#### cli + +Command line is also enhanced for operator engineers. Three sub-commands are introduced as follow: + +__create__: + +`create` command is used to create a queue with weight; for example, the following command will create a queue named `myqueue` with weight 10. + +```shell +$ vkctl queue create --name myqueue --weight 10 +``` + +__view__: + +`view` command is used to show the detail of a queue, e.g. creation time; the following command will show the detail of queue `myqueue` + +```shell +$ vkctl queue view myqueue +``` + +__list__: + +`list` command is used to show all available queues to current user + +```shell +$ vkctl queue list +Name Weight Total Pending Running ... +myqueue 10 10 5 5 +``` + +#### Scheduler + +* Proportion plugin: + + Proportion plugin is used to share resource between `Queue`s by weight. The deserved resource of a queue is `(weight/total-weight) * total-resource`. When allocating resources, it will not allocate resource more than its deserved resources. + +* Reclaim action: + + `reclaim` action will go through all queues to reclaim others by `ReclaimableFn`'s return value; the time complexity is `O(n^2)`. In `ReclaimableFn`, both `proportion` and `gang` will take effect: 1. `proportion` makes sure the queue will not be under-used after reclaim, 2. `gang` makes sure the job will not be reclaimed if its `minAvailable` > 1. + +* Backfill action: + + When `allocate` action assign resources to each queue, there's a case that ([kube-batch#492]()) the resources maybe unnecessary idle because of `proportion` plugin: there are one pending job in two queue each, and the deserved resources of each queue can not meet the requirement of their jobs. In such case, `backfill` action will ignore deserved guarantee of queue to fill idle resources as much as possible. This introduces another potential case that the coming smaller job is blocked; this case will be handle by reserved resources of each queue in other project.