Skip to content

[WRK-814] Client side impl Sandbox.resource_usage() #2966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thundergolfer
Copy link
Contributor

Describe your changes

  • WRK-814

Check these boxes or delete any item (or this section) if not relevant for this PR.

  • Client+Server: this change is compatible with old servers
  • Client forward compatibility: this change ensures client can accept data intended for later versions of itself

Note on protobuf: protobuf message changes in one place may have impact to
multiple entities (client, server, worker, database). See points above.


Changelog

  • Adds new Sandbox.resource_usage() method to retrieve billable resource usage (CPU, RAM, GPU time).

@thundergolfer thundergolfer force-pushed the jonathon/wrk-814-add-client-api-for-sandboxresource_usage branch 4 times, most recently from c3df351 to 94a1ed7 Compare March 21, 2025 15:47
@thundergolfer thundergolfer requested review from mwaskom and gongy March 21, 2025 15:47
@thundergolfer thundergolfer force-pushed the jonathon/wrk-814-add-client-api-for-sandboxresource_usage branch from 94a1ed7 to 71747b7 Compare March 21, 2025 16:10
Copy link
Contributor

@devennavani devennavani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It might be good to clarify in the comments/docstring that the returned CPU and memory values are max(used, requested)

# Memory usage is in gibibyte-nanoseconds. Using 1 GiB for 1 second is 1e9 gib ns.
# Using 0.5 GiB for 1 second is 5e8 gib ns.
mem_gib_nanosecs: int
# GPU type, e.g. "a10g". Combine with `gpu_nanosecs` to understand billable GPU spend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have a many-to-one mapping for some GPU types on configuration (e.g. we accept "A10" and "A10g", plus we're case-insensitive, etc. How are we defining the schema for GPU type here? We've broken our own internal tooling recently with some migrations; are we now obligated to keep the current schema fixed moving forward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it's possible we do modify them. Unlikely but possible.

I'm inclined to drop the field then and make the user maintain a mapping from sandbox ID -> their own GPU type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a canonical map in the backend, could just use that one. But I'm also fine not exposing it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense to me in general but do sandboxes have gpu fallbacks? or can you only request one gpu type?


# CPU usage is in core-nanoseconds. Using 1 core for 1 second is 1e9 core ns.
# Using 0.5 core for 1 second is 5e8 core ns.
cpu_core_nanosecs: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random ignorable suggestion but nanosecs is super non-human-friendly, maybe the dataclass could have a method that presents the underlying data in a more human-readable format to save people some hassle when they're using it interactively or printing it or whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good idea, I can do a nice repr.

Copy link
Contributor Author

@thundergolfer thundergolfer Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe not repr, I like this behavior:

Return a string containing a printable representation of an object. For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval();

Copy link
Contributor

@mwaskom mwaskom Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I've never thought that notion of repr() was very useful, but yeah I think you could do it with __str__ instead of __repr__, or some named method that returns a string. I'd probably try to format the CPU/RAM/GPU usage with SI prefixes so it's readable across multiple OOMs.

@mwaskom
Copy link
Contributor

mwaskom commented Mar 25, 2025

LGTM. It might be good to clarify in the comments/docstring that the returned CPU and memory values are max(used, requested)

oh missed that as I haven’t looked closely at the backend pr, but would argue that maybe suggests a different name, not just documentation (which is also good to add).

@mwaskom
Copy link
Contributor

mwaskom commented Mar 25, 2025

I think different people might really want to know both though (what did it actually use, eg for profiling, and what did we bill for, eg for passing on costs).

@mwaskom
Copy link
Contributor

mwaskom commented Mar 25, 2025

And ideally we wouldn’t expose multiple variants of similar information through multiple different ad hoc methods :/. Maybe an as_billed: bool parameter or something?

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Mar 26, 2025

Maybe an as_billed: bool parameter or something?

Like resource_usage(raw: bool = True)? If raw==True then pass through the usage, otherwise apply the reservation and regional modifiers.

Hmm I don't love it. I didn't realize that we applied compute resource multipliers because of region scheduling. I'd otherwise be OK with passing through the reserved value as-is, because I don't think this endpoint can ever serve as profiling data.

I'll leave this for a bit and think about it.

@mwaskom
Copy link
Contributor

mwaskom commented Mar 26, 2025

Like resource_usage(raw: bool = True)? If raw==True then pass through the usage, otherwise apply the reservation and regional modifiers.

I am just suggesting that we allow users to query used or max(used, reserved). I don't think this endpoint should reflect any modifiers which I see as a sort of implementation detail.

I'd otherwise be OK with passing through the reserved value as-is,

If you do that IMO I'd say it's mandatory to name the method differently; how can we describe the billing as applying to max(used, reserved) and then call the resulting value "usage"?

I don't think this endpoint can ever serve as profiling data

Why not? I'd certainly try to use it for something like that based on how it's currently presented, and I expect other users would too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants