Skip to content

Conversation

@vsmalladi
Copy link
Contributor

Added capabilities endpoint to retrieve TES backend features. Closes #206
Closes #188

Added capabilities endpoint to retrieve TES backend features.
Closes #206 
Closes #188
@vsmalladi
Copy link
Contributor Author

Response would be like this

{
  "cpu_cores": 32,
  "ram_gb": 256,
  "disk_gb": 2000,
  "num_machines": 8,
  "accelerators": [
    {
      "type": "GPU",
      "count": 4
    },
    {
      "type": "TPU",
      "count": 2
    }
  ],
  "chip_type": "x86_64",
  "fpga": true
}

@vsmalladi
Copy link
Contributor Author

Add docker or other container resources.

@andrew-nimbus
Copy link

I also wonder if we want to focus an initial version of the capability on providing GPU metadata (in addition to some baseline CPU/disk/mem inventory as well)--one would probably want things like the vendor, model, GPU mem per device, driver type and version, GPU execution model (e.g., exclusive vs. partitioned), runtime version, and the types of containers supported (which you note in the comment--indicating singularity support would likely be esp useful)--right? I'm trying to think about how irritating it can be to match a container to an accelerator environment.

Along these lines, do we want to expand the GPU metadata (to potentially target an eventual PoC implementation) and drop the FPGA flag for the moment (as it's unclear to me how to use this boolean in practice...but ive never used FPGA accelerators before).

@andrew-nimbus
Copy link

On another note, we'll want to make clear in the updated documentation that the initial goal here is to facilitate helping jobs to run on the hardware (esp. accelerators)--scheduling and optimization are for later.

I.e., this is a static inventory (i.e., "is there a GPU accessible to the TES endpoint") but not focused on status (e.g., "is the GPU available now?"). Is the implicit assumption that the TES server is effectively managing current state via its set of workers at the moment? I think this is perfectly reasonable, but it would be worth being explicit about it in any updated documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a capabilities endpoint. How do we define the technical capabilities of a given TES API

3 participants