Skip to content

DRA: DRA Driver and ResourceSlices in Composable System #130368

@hase1128

Description

@hase1128

What would you like to be added?

In composable system, it is necessary to consider the optimal design of the composable DRA driver for managing fabric devices and the vendor DRA driver for managing node-local devices.
According to KEP-5007 (kubernetes/enhancements#5012), especially (kubernetes/enhancements#5012 (comment)), there are three ideas:
(If I've missed something, please let me know.)

1. Moving the device and update ResourceSlice

2. Implement a device autoscaler in ClusterAutoscaler

3. Make vendor DRA driver aware of fabric devices

  • Basic concept: KEP-5007: DRA Device Binding Conditions enhancements#5012 (comment)
  • Advantages: The scheduler does not need to reschedule pods. Can use happy path (can proceed directly to binding after device attachment). It is possible to avoid moving devices between ResourceSlices.
  • Problem: A large feature needs to be added to the vendor DRA.
  • New implementations required for the vendor DRA driver:
    • Fabric device recognition
    • Composable-related IFs such as attach,
    • Updating ResourceSlices after attach
    • Updating BindingConditions in ResourceClaim
    • Synchronization with vendor DRA drivers on other nodes, etc.

I think idea 1 is good, but I would like to hear from DRA experts on which of these ideas is better, or if there are any better ideas.

/cc @pohly
/cc @klueska
/cc @johnbelamaric
/cc @KobayashiD27

/sig node

Why is this needed?

Composable disaggregated infrastructured needs it for GPUs connected to a node on demand:
https://kccnceu2024.sched.com/event/1ZPDw/iown-bof-challenges-of-kubernetes-for-composable-disaggregated-computing-naoki-oguchi-fujitsu-hidetsugu-sugiyama-red-hat-clara-li-intel-ryosuke-kurebayashi-ntt

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions