Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dra-evolution: partitioning of devices #20

Open
pohly opened this issue Jun 5, 2024 · 4 comments
Open

dra-evolution: partitioning of devices #20

pohly opened this issue Jun 5, 2024 · 4 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@pohly
Copy link
Contributor

pohly commented Jun 5, 2024

This was excluded from #14 to limit the scope. It's a stretch goal for 1.31.

/assign @klueska @johnbelamaric

@johnbelamaric
Copy link
Contributor

johnbelamaric commented Jun 14, 2024

In the 1.31 KEP, we included the APIs defined in #27, but Mrunal and Tim raised legitimate concerns with the ...verbosity... of that API.

I can think of a few alternatives we can debate, and will propose them as separate PRs in this repo.

cc @thockin @mrunalp @pohly @klueska

@johnbelamaric
Copy link
Contributor

johnbelamaric commented Jun 14, 2024

Here are some options. Each of the options 2+ are built on top of option 1.

I suggest looking at the file dra-evolution/testdata/pools-two-nodes-dgxa100.yaml in each PR. This is an example YAML for two 8 GPU servers based on the NVIDIA simulated devices. Real world will be similar, but probably add MORE attributes.

Option 1

  • As in the 1.31 KEP
  • No attributes shared among devices
  • Flattens (de-normalizes) all physical GPU shared resources into a single array of SharedCapacity, using capacity name to differentiate (e.g., gpu-0-memory-block-0)
  • Minimal changes for partitionable devices in DRA evolution prototype #27
  • YAML for two DGXA100 nodes (each with eight A100 devices)
    • 341k
    • 12661 lines

Option 2

  • Similar to Kevin's original proposal, as well as to this
  • No attributes shared among devices
  • Shared capacity for each physical GPU are grouped into named sets, rather than embedding the physical GPU name in the capacity name as in Option 1
  • Consumption of those per device then name the specific groups from which they consume
  • Reapply changes for grouped resources #34
  • YAML for two DGXA100 nodes (each with eight A100 devices)
    • 362k
    • 13525 lines

Option 3

  • Like option 1, but share common attributes across all devices, overlaying device-specific ones
  • Partitionable with common attributes #30
  • YAML for two DGXA100 nodes (each with eight A100 devices)
    • 205k
    • 7695 lines

Option 4

  • Takes option 3 a step further, and defines a common shape for devices, not just common attributes
  • In the non-partitioned case, this amounts to common attributes, so it also compacts things a lot in that case
  • Partitionable model with a common shared shape #31
  • YAML for two DGXA100 nodes (each with eight A100 devices)
    • 29k
    • 1121 lines

Option 5

  • Takes option 4 a step further, and allows partitions to be algorithmically generated inside the common shape (addresses issue of many partitions, not just many devices, a la Mrunal's question of "100 memory blocks")
  • Partitionable model with generated partitions #32
  • Complex, just as expressive as enumeration though, because you an still enumerate with this API (very very ugly though)
  • Example yaml is hand crafted in this one, rather than generated from mock dgxa100.
  • If we do get to 10x the number of memory blocks, and therefore 100s of possible partitions, it could be much much better than explicity enumeration.
  • YAML for two DGXA100 nodes (each with eight A100 devices)
    • 19k
    • 780 lines

Option 6

  • A slightly different take similar to option 4, but flattens the shape into a list of "partition templates", and then references each template from the explicitly listed set of available partitions in the devices list.
  • WIP: Add a POC of an alternate partitioning scheme #35
  • YAML for two DGXA100 nodes (each with eight A100 devices)
    • TBD

Option 7

  • A merging of optoins 4 and 6 that pushes all the things that are invariant across nodes into their own object and refers to it from the slice. This avoids objects that grow non-linearly as number of partitions per device increases.
  • See WIP: Add a POC of an alternate partitioning scheme #35 (comment) for an in depth discussion.
  • PR TBD
  • YAML for two DGXA100 nodes (each with eight A100 devices)
    • TBD

@johnbelamaric
Copy link
Contributor

FYI, I fixed the accidental merge of the wrong PR, and merged Option 1, which matches the KEP (except in the ResourcePool -> ResourceSlice naming).

I also then rebased all the other PRs on top of that. So, it's easier to see the deltas between the KEP and each of the options 1-4.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants