Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add New Resource API proposal #782

Closed

Conversation

vikaschoudhary16
Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Jul 6, 2017

Notes for reviewers

First proposal submitted to the community repo, please advise if something's not right with the format or procedure, etc.

cc @aveshagarwal @jeremyeder @derekwaynecarr @vishh @jiayingz

Signed-off-by: vikaschoudhary16 vichoudh@redhat.com

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 6, 2017
// +patchMergeKey=key
// +patchStrategy=merge
Key string
// Example 0.1, intel etc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are example of Values. For example if Key is 'version', Values can have '0.1' OR Key is 'vendor', Values can have 'intel'. I will make it more clear.

ResourceSelectorOpExists ResourceSelectorOperator = "Exists"
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
ResourceSelectorOpGt ResourceSelectorOperator = "Gt"
ResourceSelectorOpLt ResourceSelectorOperator = "Lt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned Equal To on line 43 but no Eq operator here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will correct at line 43 to use "In" in place of "Equal to"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using In for Eq causes confusion.

* User can view the current usage/availability details about the resource class using kubectl.

### User story
Admin knows what all devices are present on nodes and have deployed corresponding device plugins. Device plugins will make devices appear in node status. Next admin creates resource classes that have generic/portable names and metadata which can select available devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly rephrased this paragraph.

The administrator as deployed device plugins to support hardware present in the cluster. The Kubelet will update its status indicating the presence of this hardware via those device plugins. To offer this hardware to applications deployed on kubernetes in a portable way, the administrators creates a number of resource classes to represent that hardware. These resource classes will include metadata about the device and selection criteria.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will update!

5. The user deletes the pod or the pod terminates
6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate` on the matching Device Plugins

The scheduler is incharge of both, selecting a node and also selecting a device for requested resource classes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to node selection, the scheduler is also responsible for selecting a device that matches the resource class requested by the user.

* Select device at pod admission while applying predicates and change all api interfaces that are required to pass selected device to container runtime manager.
* Create resource consumption state again at container runtime manager and select device.

None of the above approach seems cleaner than doing device selection at scheduler.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning here that this decision helps to retain a clean abstraction between the container runtime and kubernetes? ISTR that was a portion of the discussion.


## Future Scope
* RBAC: It can further be explored that how to tie resource classes with RBAC like any other existing API resource objects.
* Nested Resource Classes: In future device plugins and resource classes can be extended to support the nested resource class functionality where one resource class could be comprised of a group of sub-resource classes. For example 'numa-node' resource class comprised of sub-resource classes, 'single-core'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to communicate the plans/decisions on how ResourceClasses relate to Opaque Integer Resources.

@ConnorDoyle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever existing resource discovery tools are available which update node objects with OIR, those will adapt to update node status devAllocatable with devices instead. Will add more details.

@vikaschoudhary16 vikaschoudhary16 force-pushed the resource_class branch 2 times, most recently from a3d30c3 to c7f5820 Compare July 6, 2017 19:05
@derekwaynecarr derekwaynecarr self-assigned this Jul 6, 2017
@RenaudWasTaken
Copy link

@vikaschoudhary16 I don't see any mention of overlapping resources is that something you plan to address in another PR ?

@vikaschoudhary16
Copy link
Contributor Author

@RenaudWasTaken It is explained implicitly by explaining how resource class can select device with k-v metadata and how a resource class can select different devices. Overlapping is essentially the functionality which is providing portability.
I will add more details to make it more visible and explicit.
Thanks!

@RenaudWasTaken
Copy link

RenaudWasTaken commented Jul 7, 2017

@RenaudWasTaken It is explained implicitly by explaining how resource class can select device with k-v metadata

I'm also interested in how you solved selecting multiple overlapping resource class because it is not a trivial problem.
An example would be:

  • node has 2 GPUs:
    • 1 GPU with 4G
    • 1 GPU with 8G
  • Cluster has 2 resource classes:
    • GPU with memory > 2 (lowMemGPU)
    • GPU with memory > 4 (highMemGPU)
  • User submits pod with 2 containers:
    • The first one requests 1 lowMemGPU
    • The second one requests 1 highMemGPU

In this example if your selection algorithm is a first fit then it is very possible that you won't be able to satisfy the request because you might give the gpu with 8G to the first container.

It seems to me that your only solution is to generate all the possible permutations but that doesn't scale well...
Another solution I thought about was to have multiple algorithms and give the option to either the end user or the cluster admin to select which one he wanted to use.
But it feels like this edge case doesn't have any good solutions...

What do you think ?

@vikaschoudhary16
Copy link
Contributor Author

@RenaudWasTaken

Another solution I thought about was to have multiple algorithms and give the option to either the end user or the cluster admin to select which one he wanted to use.

Thanks i missed to add these details though had in mind. This proposal is implementing a first fit selection process. For better resource usage and scalability, in future, selection algorithm will be optimized on the lines you mentioned.

name: nvidia.high.mem
spec:
resourceSelector:
-
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a superfluous line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand this is yaml syntax for nested sequences, https://learn.getgrav.org/advanced/yaml#sequences

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh obviously …
Would pulling matchExpressions into line 89 look saner`

    - matchExpressions:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure, will update. Thanks!


## Motivation
Compute resources in Kubernetes are represented as a key-value map with the key
being a string and the value being a 'Quantity' which can (optionally) be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to OIRs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand yes.

resourceSelector:
-
matchExpressions:
-
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something missing here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand same as above.

@fabiand fabiand mentioned this pull request Jul 7, 2017
@saad-ali
Copy link
Member

saad-ali commented Jul 7, 2017

A couple of drive-by comments:

  1. What happens when a ResourceClass is deleted before the pod referencing it? Particularly if kubelet has to enforce the limits.
  2. Also consider the case where kubelet and/or scheduler crash and lose in-memory state (and the ResourceClass object is deleted).
  3. Define which, if any, of the fields of ResourceClass are immutable, and if they are mutable what the expected behavior is.
  4. Maybe worth calling out more explicitly that this will be non-namespaced and expected to be created by cluster admins.

@vikaschoudhary16
Copy link
Contributor Author

@saad-ali
Thanks for taking a look. Please find my responses as follows:

  1. What happens when a ...

This proposal's scope is bounded by device plugin proposal's scope, which is to cover devices only and not resources such as CPU and memory.
Since device for the pod is being selected by scheduler, so in case resource class gets deleted, either pod will fail the predicate at scheduler or pod will have the device info.

  1. Also consider the case where kubelet and/or scheduler crash ...

In scheduler crash case, for any unscheduled pods which request the deleted resource class, scheduling will fail at predicate validation. Similarly predicate will also fail at kubelet because for any new pod kubelet creates resource consumption state from the beginning.

  1. Once created, resource class object is immutable, only status will be updated by scheduler later on.
  2. sure, will mention explicitly.

- "1G"
```
Above resource class will select all the hugepages with size greater than
equal to 1 GB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/greater than equal to/greater than

values:
- "nic"
key: "speed"
operator: "In"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean either Eq or Gt instead of In?

- "40GBPS"
```
Above resource class will select all the NICs with speed greater than equal to
40 GBPS.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should change based on the change above.

2. Iterate over all existing nodes in cache to figure out if there are devices
on these nodes which are selectable by resource class. If found, update the
resource class availability status in local cache.
3. Patch the status of resource class api object with availability state in locyy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/locyy/local

ResourceSelectorOpExists ResourceSelectorOperator = "Exists"
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
ResourceSelectorOpGt ResourceSelectorOperator = "Gt"
ResourceSelectorOpLt ResourceSelectorOperator = "Lt"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using In for Eq causes confusion.

In addition to node selection, the scheduler is also responsible for selecting a
device that matches the resource class requested by the user.

### Reason for not preferring device selection at kubelet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be change this to Preferring device selection in the scheduler (instead of the Kubelet).

device that matches the resource class requested by the user.

### Reason for not preferring device selection at kubelet
Kubelet does not maintain any cache. Therefore to know the availability of a device,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/to know the availability of a device/to know the quantity of the device available for scheduling/

### Reason for not preferring device selection at kubelet
Kubelet does not maintain any cache. Therefore to know the availability of a device,
will have to calculate current total consumption by iterating over all the admitted
pods running on the node. This is already done today while running predicates for

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be rephrase as the quantity of the device already consumed should be calculated by iterating.

consumption state that is created at runtime for each pod, are exactly same,
current api interfaces does not allow to pass selected device to container manager
(where actually device plugin will be invoked from). This problem occurs because
devices are determined internally from resource classes while other resource

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not opposite (i.e., resource classes are determined from devices)?

From the events perspective, handling for the following events will be added/updated:

### Resource Class Creation
1. Init and add resource class info into local cache

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/Init/Initialize

## Opaque Integer Resources
This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
(OIR). External agents can continue to attach additional 'opaque' resources to
nodes, but the special naming scheme that is part of the current OIR approach

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/but the special naming scheme that is part of the current OIR approach will no longer be necessary/using device plugins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balajismaniam
not necessarily device plugins, OIRs could also be used same as it is used today(without device plugins). Using OIR with device plugins will be case where plugin has not adapted device advertisement as per device plugin proposal and use OIRs to advertise resources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

plugins.

## Opaque Integer Resources
This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/supercede/supersede

*Resource Class* is a new type, objects of which provides abstraction over
[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype).
A *Resource Class* object selects devices using `matchExpressions`, a list of
(operator, key, value). A *Resource Class* object selects a device if atleast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/atleast/at least

In addition to node selection, the scheduler is also responsible for selecting a
device that matches the resource class requested by the user.

### Reason for not preferring device selection at kubelet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section was a little confusing to read (what exactly does device selection mean?) -- maybe a concrete example would help?

Was the option considered where the scheduler is responsible for mapping resource classes from the container spec to resource names from the node capacityV2, but assigning specific devices is left to the Kubelet? IIRC a reason to delay device binding to Kubelet was to avoid publishing hardware topology to the API server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConnorDoyle
I have updated several initial sections of the doc to make it more clear. Device selection means selecting a device using the resource class details which includes applying different operators like explained in 'Resource Class' section of this document.

scheduler is responsible for mapping resource classes from the container spec to resource names from the node capacityV2

Yes. and this section notes down challenges in that approach. I have updated this section to make it more clear. Hope it is more understandable now.

Kubelet was to avoid publishing hardware topology to the API server.

This proposal assumes that device details are updated in node status by vendor device plugins, as proposed in device plugin proposal.

1. Get the requested resource class name and quantity from pod spec.
2. Select nodes by applying predicates according to requested quantity and Resource
class's state present in the cache.
3. On the selected node, select a Device from the stored devices info in cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to say something like "concrete resource" here instead of "Device"? There's no limitation that resource classes can only represent devices right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is a limitation. Resource class would be able to represent only the devices which are advertised by device plugins in Device structure format in the node status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikaschoudhary16 But devices may have the same name (device-id) across nodes:

  • Node1 have dev-1 which satisfied ResourceClassA
  • Node2 have dev-1 which also satisfied ResourceClassA

How do we cache the devices info in ResourceClassA? use this pattern: Node1-dev-1 and Node1-dev-1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A deviceinfo structure will be maintained in schedulercache. So each device will have a list of what all resource class it is satisfying.
Take a look: https://github.com/vikaschoudhary16/kubernetes/pull/3/files#diff-558cb8bde14dca10a3151bfc222a3aae

be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)

## Resource Class
*Resource Class* is a new type, objects of which provides abstraction over
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/provides/provide


1. A user submits a pod spec requesting 'X' resource classes.
2. The scheduler filters the nodes which do not match the resource requests.
3. scheduler selects a device for each resource class requested and annotates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotates the pod object

Do you have an example of what the annotation might look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConnorDoyle
sure, will update. Thanks!

@vikaschoudhary16 vikaschoudhary16 force-pushed the resource_class branch 2 times, most recently from f6d6c95 to 53c2a80 Compare July 11, 2017 08:12
@vikaschoudhary16
Copy link
Contributor Author

Thanks @ConnorDoyle @saad-ali @balajismaniam @RenaudWasTaken @fabiand for the review comments.
I have updated the doc to address those. PTAL!

extended to support the nested resource class functionality where one resource
class could be comprised of a group of sub-resource classes. For example 'numa-node'
resource class comprised of sub-resource classes, 'single-core'.
* Multiple device selection algorithms, each with a different selection strategy,
Copy link

@RenaudWasTaken RenaudWasTaken Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be thoroughly discussed and not just a "side note" on the bottom of the design doc.
Maybe a sig-scheduling discussion ?

Copy link
Contributor Author

@vikaschoudhary16 vikaschoudhary16 Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RenaudWasTaken totally agree. Scheduler section is also updated about "first fit" approach.
Example use case you quoted can be handled by "best fit" approach, which I thought to cover in follow-up proposal. And meanwhile if a user dont want to use "first fit", he/she can request devices using OIRs and can bypass resource classes. Thats how i am thinking. Anyways if community thinks differently, happy to discuss and adapt the proposal accordingly.

@vikaschoudhary16
Copy link
Contributor Author

/sig scheduling

@derekwaynecarr
Copy link
Member

i will prioritize reviewing this further when device plugins are agreed upon.

`scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4`
where `scheduler.alpha.kubernetes.io/resClass` is the common prefix for all the
device annotations, `tes-res-class` is resource class name,
`nvidia-tesla-gpu` is the selected device name and `4` is the quantity requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikaschoudhary16 OK, I see here is how the user would request an amount of one of these resources. I still think that combining the device with the resource class in the selector is going to be problematic, though.

Let me ask something... do you envision the user knowing about the devices on nodes that are providing a particular resource class? Put another way... would (should?) the proposed end user that is constructing the pod spec and specifying the resource requirement selector scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4 actually know that the "test-res-class" resources were being provided by the "nvidia-tesla-gpu" device(s)?

My guess is that, often, the end user won't know about the underlying device that is providing some amount of abstracted resources. The deployer of the cloud infrastructure knows that information, of course. But the end user that is consuming some of those resources won't necessarily know the exact vendor/device information and deployers may not actually want the end user to know the device/vendor/model specifics :-)

To respond to your comment from above that we should think of resource classes in the same vein as EC2 instance types, I'd just comment that Amazon has complete and total control over the information it gives users with regards to the amount of resources and the types of capabilities that its instance types comprise. Amazon can change (and has changed in the past) its mind about the quantities of resources and quality/vendor models associated with a particular instance type. If Kubernetes continues to be agnostic to cloud infrastructure providers, I think two things are necessary:

  1. The way that Kubernetes advertises resources and capabilities to end users should hide underlying device implementation details as much as possible. This includes hiding vendor information as much as possible.
  2. All concepts that provide a grouping/coupling mechanism between qualitative and quantitative things should be possible to specify as their individual quantitative and qualitative components.

The second point probably deserves a little more explanation. What I mean is that if you are going to expose a concept like a ResourceClass object -- something that allows the deployer to describe a collection of consumed resource amounts as well as capabilities that describe one or more providers of those consumable resources -- then the end user should be able to request a pod consumes those coupled resources and lands on a node with those capabilities without using the coupled object.

In other words, if you have a ResourceClass that looks like this:

kind: ResourceClass
metadata:
  name: gpu.high.mem
spec:
  resourceSelector:
    - matchExpressions:
        - key: "gpu.vendor"
          operator: "In"
          values:
            - "nvidia"
            - "intel"
        - key: "gpu.memory"
          operator: "GtEq"
          values:
            - "4G"

the end user should be able to request a pod where container in the pod need 1 gpu.high.mem OR the end user should be able to request a pod where containers in the pod need 4G of resource type gpu.memory and the "gpu.vendor" annotation/selector is "intel" instead of either "intel" OR "nvidia". That's what I mean about breaking the coupling down into its finest-grained representation and allowing the end user to specify that fine-grained request.

Hope that makes sense! I recognize that sometimes, the terminology I use is overlapping and confusing, so I apologize in advance about that. I'm trying to bridge the terminology differences between the OpenStack infrastructure representation of these things with the proposed Kubernetes representation of similar ideas.

Best,
-jay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaypipes

The way that Kubernetes advertises resources and capabilities to end users should hide underlying device implementation details as much as possible. This includes hiding vendor information as much as possible.

Thats upto the deployer. Deployer is free to not use any keys in the resource class which he thinks, user should not know about. Proposal does not make it mandatory to create resource classes with any vendor specific details.

All concepts that provide a grouping/coupling mechanism between qualitative and quantitative things should be possible to specify as their individual quantitative and qualitative components.

I think i understood what your point is. Problem with this approach is that it will become identity mapping. Resource classes are aimed at creating broader abstractions where arbitrary ranges for resource properties is also able to be supported. There are two main problems:

  • Portability gone: With current approach, resource class is an allocatable unit. So though its providing a broader abstraction but still its consumed, capacity and remaining units are countable. By this, admin knows, what is the quota which is being offered. If user is left free to choose the range of resource property, like gpu.memory gt 30, one cluster may support but other may not. But if resource classes are treated as a single allocated unit with non-mutated properties, it is easier for admins to control the availability of it as a resource across clusters.
  • We cant expect end user to know that much about the device properties. There will be hell more chances of mis-configuration.

Thoughts?

Copy link
Contributor

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, i'm interested in resource classes and devices mapping, I think it is the key point of this proposal.

cluster. Device plugins, running on nodes, will update node status indicating
the presence of this hardware. To offer this hardware to applications deployed
on kubernetes in a portable way, the administrator creates a number of resource
classes to represent that hardware. These resource classes will include metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation.

the presence of this hardware. To offer this hardware to applications deployed
on kubernetes in a portable way, the administrator creates a number of resource
classes to represent that hardware. These resource classes will include metadata
about the devices as selection criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example how resource classes know which devices are selectable? AFAIK, as mentioned above, the administrator creates a resource class without any devices info or node info, who is responsible to do the mapping (ResourceClass --> Devices) work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler will watch apiserver for resource class objects creation and then scheduler already has nodeinfo in the cache. For each kind of device, a deviceinfo structure will be instantiated and list of device info will reside in nodeinfo. At each resource class creation, scheduler will iterate over the deviceinfo list and if resourceclass matches the device, deviceinfo's list of resourceclass references is updated.

For more detalis take a look at PoC: https://github.com/vikaschoudhary16/kubernetes/pull/3/files#diff-558cb8bde14dca10a3151bfc222a3aae

1. Initialize and add resource class info into local cache
2. Iterate over all existing nodes in cache to figure out if there are devices
on these nodes which are selectable by resource class. If found, update the
resource class availability status in local cache.
Copy link
Contributor

@ScorpioCPH ScorpioCPH Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikaschoudhary16 I'm very interested in the details about this. As commented above, how does Resource Class know which devices is selectable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does above reply answer this question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

on the matching Device Plugins

In addition to node selection, the scheduler is also responsible for selecting a
device that matches the resource class requested by the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how scheduler know which device matches the resource class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, now that device plugin has got merged, i need to update this proposal for remaining enhancements of device plugin for resource classes.
In the ListWatch() response, device plugin will send device properties also in the arbitrary key-value map. Using this, a Device api object will be created by device manager. Scheduler will keep watching api server for any new Device object creation and will keep synching its device info in its cache.

Copy link
Contributor

@ScorpioCPH ScorpioCPH Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sound like that it depends on device properties which exposed by device plugin. So we can't support any matchExpressions that can not be understood by device properties, am i right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. matchExpressions must have a subset of properities exposed by device plugin.


## Motivation
Kubernetes system knows only two resource types, 'CPU' and 'Memory'. Any other
resource can be requested by pod using opaque-integer-resource(OIR) mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should say ExtendedResource now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. As we discussed on slack, will iterate.

```yaml
kind: ResourceClass
metadata:
name: fast.nic
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side note that the device manager today does not support network interfaces. Thus - as resource classes build upon device manager - this would not work.

However, I'd love to see that the device manager is getting extended to support this one-off case of network resources as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key cross-sig deliverable. We have discussed it at length in the RMWG and it was deferred because we wanted it to be led by sig-network (or more likely the new Network Plumbing WG).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we start with a simple model to support high-performance nic with a combination of CNI plugin and device plugin? The CNI plugin takes care of network interface setup and management, and can be portable across different container orchestration systems. The device plugin can run as a sidecar container and take care of device initialization, health monitoring, and resource advertising. I know having device plugin act like a middleman between CNI and Kubelet may pose certain limitations. We can discuss them and see whether we may solve them by enriching the information passed between device plugin and Kubelet. However, at least for now, I hope the information passed between device plugin and Kubelet can stay at resource level (such as resource name and properties) and device level (such as device runtime configuration). This way, it is clear at the API level that Kubelet is the central place in charge of resource allocation and container runtime setup, which I feel perhaps will be easier to be extended to support future features like cross-resource affinity allocation and etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the sig have enough resource to implement this case, I would like to see this design keep as simple as possible and hopefully get merged in 1.10 cycle.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, thanks for the info.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiayingz technically it can also be done closer to CNI. To me it just looks like it will be a much cleaner interface to provide the network connectivity via the device manager. This will effectively formalize how components can add additional NICs to a pod. Today this is unspecified and everybody is pretty much using CNI in some way to achieve this.
The problem is that there are so many different assumptions by different projects. Eventually it even implies changing the CNI config of the host which is not necessarily desireable.

OTOH if we could use the device manager, then we know a way how to distribute such a plugin and how it would operate, which could eventually lead to more cooperation, and to less wild west.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see this POC'd as a completely pluggable concept addition.

* If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu'
type device which has memory greater than or equal to 'X' GB, should be able
to satisfy this request, independent of other device capabilities such as
'version' or 'nvlink locality' etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting too selective on non-fungible resources can be a dangerous game vs. channeling a lowest common denominator.

For a decade on grid systems we allow arbitrary matching on any attribute via an expressive configuration language, and it was eventually highly abused by its users to hoard the prized resources.

In addition to node selection, the scheduler is also responsible for selecting a
device that matches the resource class requested by the user.

### Reason for preferring device selection at the scheduler and not at the kubelet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user requests ER in pods that not handled by scheduler(daemonset pods or static pods)?

@resouer
Copy link
Contributor

resouer commented Jan 31, 2018

@vikaschoudhary16 knock knock :) As @timothysc suggested, maybe we can arrange a PoC? Any free bandwidth?

classes which could select this device in the cache.
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
6. Add the pod reference in local DeviceToPod mapping structure in the cache.
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'
Copy link

@RenaudWasTaken RenaudWasTaken Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to know the opinion of some sig-scheduling folks on the best way to achieve that.
Sending a patch request during host selection for a pod seem a bit different from the current scheduling implementation.

Maybe @bsalamat or @timothysc ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending a patch request directly here is not right.

But we can probably update the annotation of assumedPod, and update Pod api object during bind(), which is async.

@bsalamat Make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally do not like the idea of scheduler knowing about mapping of devices to resource classes. This does not fit well in scheduler's logic and does not seem to be something that scheduler should be aware of.
It is also not clear why scheduler needs to update the pod with the selected device annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat The problem I can see here is: device plugin is not expected to have any scheduling logic (e.g. select devices) by design, and if scheduler also does not want to have it. The only place to select devices will be kubelet. But what this proposal is describing requires awareness of whole picture of all devices info in the cluster to make decision, this is not what one kubelet is capable of (it only has device info of this node).

I am willing to pushing this design as it do help to enable GPU topology in Kubernetes a lot, and also, fixes blockers for other devices like FPGA. So would be great to know your ideas for scheduling.

@RenaudWasTaken
Copy link

RenaudWasTaken commented Feb 9, 2018

Do you mind adding a pod spec example to your design document? I'm fairly certain Resource classes are supposed to be requested in the resources field but It's probably better to have explicit confirmation :)

After discussing it a bit internally and since this is going to be discussed at the Face 2 Face, we think it might be a good idea to discuss device sharing ideas in this proposal as sharing might have a significant impact on the technical implementation.
And in general if we agree that the sharing block should be implemented with this model, it would be a pretty good argument that this design document is a good step forward.

There are two kinds of device sharing:

  • "simple", simple because you only need to express the sharing notion between the containers
  • "complex", complex because it requires expressing more than just the fact that devices are shared, it also needs to express an action by the underlying infrastructure (mainly the device plugin) for the service to properly be used.
    e.g: for GPUs MPS needs a daemon to run

simple sharing

Seem to be something that might be expressed as a construct on top of the ResourceClass API and built into the podSpec exactly like other sharing APIs such as volumes.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  initContainers:
    - name: myInitContainer1
      image: "nvidia/cuda"
      resources:
        limits:
          devices: ["nvidia-gpu"]
  containers:
    - name: myInferenceContainer1
      image: "nvidia/cuda"
      resources:
        limits:
          devices: ["nvidia-gpu"]
    - name: myInferenceContainer2
      image: "nvidia/cuda"
      resources:
        limits:
          devices: ["nvidia-gpu"]
  devices:
    - name: "nvidia-gpu"
      resources:
        nvidia.high.mem: 1
--- 
kind: ResourceClass
metadata:
  name: nvidia.high.mem
spec:
  resourceSelector:
    - matchExpressions:
        - key: "Kind"
          operator: "In"
          values:
            - "nvidia-gpu"
        - key: "memory"
          operator: "GtEq"
          values:
            - "30G"

Complex sharing

Coud be expressed by adding labels to the devices field such as the following (which could be advertised by the device plugin):

  devices:
    - name: "nvidia-gpu"
      labels: ["nvidia.com/MPS"]
      resources:
        nvidia.high.mem: 1

classes which could select this device in the cache.
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
6. Add the pod reference in local DeviceToPod mapping structure in the cache.
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally do not like the idea of scheduler knowing about mapping of devices to resource classes. This does not fit well in scheduler's logic and does not seem to be something that scheduler should be aware of.
It is also not clear why scheduler needs to update the pod with the selected device annotation.

1. Discovery, advertisement, allocation/deallocation of devices is expected to
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)

## Resource Class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed over the doc. I think any resource class proposal should cover both node resources and cluster resources. This proposal does not cover the latter.

@ns-sundar
Copy link

ns-sundar commented Feb 15, 2018

I have been lurking for a while to see how this shapes up. This proposal seems aimed at only at selecting devices based on metadata. There is certainly a need to model resources in K8s but the current proposal needs to be enhanced in many ways to reflect the variety of devices, usage models and use cases.

Let us start by stating what we would want from a resource model:

  • Many types of resources cannot be shared: once assigned to a pod/
    container, they cannot be assigned to another until the first one
    releases it (or is preempted). E.g. SR-IOV VF. So, we would want
    a way to track the inventory and usage of resources.
  • Some devices such as FPGAs can offer multiple 'regions', which can be
    programmed with different accelerators. If we represent each accelerator
    as a resource, a simpe model of 'devices with resource classes' is
    not enough. We instead need a way to express that a region nested
    inside an FPGA contains an accelerator. This calls for a hierarchy.
  • FPGAs (and GPUs) often contain local memory, which is a different kind
    of resource. A user may want, say an ipsec accelerator with 2 GB of
    local memory, both of which need to come from the same device.
    Further, depending on the implementation, some of the memory may be
    dedicated to one region, while others can be shared across regions
    within the device.

To crystallize the ideas above, consider this FPGA card currently in the market [*]. It can be abstracted as below:

      dedicated <-> region <-> common <-> region <-> dedicated
        memory        A        memory       B         memory

How would such a device be represented and handled in this proposal? I think the scope needs to be broadened considerably.

[*] This is only an example. I am not affiliated with, or own stock in, the company selling this product.

@XiaoningDing
Copy link

@RenaudWasTaken In your example of complex sharing, do users have to be aware of MPS? Is it possible that device plugin talks to MPS daemon to setup per-container GPU resource limit, without users knowing about MPS?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: derekwaynecarr

Assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vikaschoudhary16 vikaschoudhary16 force-pushed the resource_class branch 5 times, most recently from a2ff1dc to adb3da2 Compare April 11, 2018 09:36
Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>
@vikaschoudhary16 vikaschoudhary16 changed the title Add Resource Class proposal Add New Resource API proposal Apr 11, 2018
ComputeResource objects are similar to PV objects on storage. It is tied to the physical resource that can have a wide range of vendor specific properties. Kubelet will create or update ComputeResource objects upon any resource availability or property change for node-level resources. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. ComputeResource API can be included into NodeStatus to facilitate resource introspection.
For cluster level resources, a special controller or a scheduler extender can create a ComputeResource and dynamically bind that to a node during or after scheduling.

### ResourceClass API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the users use ResourceClass directly in their pods ? How about letting them create ComputeResourceClaims(CRCs) and use CRCs in their pods just like PV/PVC does ? Users can express their requirements in CRCs and scheduler(or some controller) can match CRCs and CRs.
And using ResourceClass for auto provision if necessary ?

@vikaschoudhary16
Copy link
Contributor Author

Created a KEP, #2265, for this, so closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.