Skip to content

Use Antigravity in README#4

Merged
Benjamin Elder (BenTheElder) merged 1 commit into
mainfrom
agy
May 20, 2026
Merged

Use Antigravity in README#4
Benjamin Elder (BenTheElder) merged 1 commit into
mainfrom
agy

Conversation

@rakyll

Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BenTheElder Benjamin Elder (BenTheElder) merged commit e24f170 into main May 20, 2026
4 checks passed
@thockin Tim Hockin (thockin) deleted the agy branch May 20, 2026 00:43
@BenTheElder Benjamin Elder (BenTheElder) added the kind/docs Improvements or additions to documentation label May 21, 2026
Davanum Srinivas (dims) added a commit to dims/substrate that referenced this pull request May 27, 2026
…A buffer

The original feat/gpu-passthrough commit (c358dff) wired the CRD,
proto and runsc flags but the demo only got as far as golden actor
Run + Checkpoint; user actor Restore failed with `inconsistent
private memory files on restore: savedMFOwners=[pause:/]` and the
CUDA buffer in the workload was never observed to survive a
substrate suspend/resume cycle.

This commit lands the five additional fixes the demo needed on the
H100 brev box `front-emerald-krill` (driver 570.195.03, gVisor
nightly 2026-05-26). With these, a 1 MiB CUDA buffer set via
cuMemsetD8_v2 to byte 0x63 reads back at the same dev_ptr after a
`kubectl ate suspend` + idle + `kubectl ate resume` cycle.

1. cmd/atelet/oci.go: add spec.Linux.Resources.Devices allow entries
   for every nvidia char-device. Without these the OCI bundle gives
   nvproxy the path but the host's cgroup eBPF device filter denies
   ioctl access in the sandbox boot path.

2. cmd/atelet/main.go: pass `firstGpuSpec(...)` to the pause
   container's prepareOCIDirectory too. Previously only the
   supervisor sub-container got --nvproxy via its OCI spec; runsc
   create pause launched the sandbox kernel with nvproxy disabled
   (`--dev-io-fd=-1` in the runsc debug log), so the dev gofer was
   never wired up and supervisor sub-container ioctls failed inside
   the sandbox with `nvproxy: failed to open device gofer nvidiactl:
   devutil.CtxDevGoferClient is not set`.

3. cmd/atelet/oci.go: bind-mount cuda-checkpoint and
   cuda-checkpoint-wrapper.sh from /run/ateom-gvisor/static-files
   (the shared HostPath volume) into /usr/local/bin inside the
   sandbox, falling back to /usr/local/bin on the atelet host.
   atelet runs inside the kind-control-plane container which doesn't
   have /usr/local/bin/cuda-checkpoint, so the previous os.Stat
   silently skipped both mounts.

4. cmd/ateom-gvisor/runsc.go + main.go: add cmdDrainCUDA and
   cmdUntoggleCUDA helpers that `runsc exec supervisor
   /usr/local/bin/cuda-checkpoint --toggle --pid 1` before
   CheckpointWorkload and after RestoreWorkload respectively.
   gVisor's --save-restore-exec-argv flag runs the binary inside the
   container being checkpointed (pause for substrate's root
   sandbox), but pause is the k8s pause image — distroless,
   no /bin/sh — so wrapper scripts with #!/bin/sh shebangs fail with
   `failed to load /usr/local/bin/cuda-checkpoint-wrapper.sh:
   no such file or directory`. Running cuda-checkpoint in the
   supervisor sub-container instead works because libcuda is there
   and the supervisor's PID 1 is the workload Python process.

5. cmd/ateom-gvisor/runsc.go: gpuSaveRestoreFlags returns nil and
   the comment explains why (vs. the previous comment which claimed
   nvproxy auto-registers; on the gVisor versions we use it does
   not — there's no auto-registration code anywhere in the source —
   and explicit registration via the CLI flag conflicts with the
   external drain in agent-substrate#4).

Empirical demo trace (front-emerald-krill, 2026-05-27 15:42 UTC):

    BEAT3   /set?val=99      → {"ok": true, "val": 99}
            /sum             → {"sum": 405504, "sample": 99, ...}
            /info            → {"dev_ptr": "0x7fe846600000", ...}

    BEAT4   kubectl ate suspend actor gpu1   → STATUS_SUSPENDED
    BEAT5   5 s idle
    BEAT6   kubectl ate resume  actor gpu1   → STATUS_RUNNING
            /info            → {"dev_ptr": "0x7fe846600000", ...}
                              ^^^ same address — CUDA context restored
            /sum             → {"sum": 405504, "sample": 99, ...}
                              ^^^ same data  — buffer survived suspend

Two operational notes for the gpu-counter demo (live in the openshell
driver repo):
- the workload image must bake the host's `libcuda.so.<host-driver>`;
  on kind there is no `nvidia-container-cli configure` hook to inject
  it from the host. The 580.x libcuda from the nvidia/cuda:12.6 base
  is rejected by nvproxy 570 with cuInit=NO_DEVICE.
- the runsc binary substrate uses must be the 2026-05-26 nightly or
  later; the release-20260520.0 tag has a multi-container nvproxy
  dev-gofer bug that returns cuInit=NO_DEVICE inside the supervisor
  sub-container even when pause has --nvproxy.

Companion notes:
  - notes/openshell-on-substrate/2026-05-27-gpu-passthrough-impl-log.md
  - notes/openshell-on-substrate/2026-05-25-gpu-passthrough-analysis.md
Eitan Yarmush (EItanya) added a commit to EItanya/substrate that referenced this pull request Jun 2, 2026
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Co-authored-by: Peter Jausovec <peter.jausovec@solo.io>
Nina Polshakova (npolshakova) pushed a commit to npolshakova/substrate that referenced this pull request Jun 11, 2026
* enable websockets (agent-substrate#4)

Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Co-authored-by: Peter Jausovec <peter.jausovec@solo.io>

* feat: allow running with vanilla k8s

- add a helm chart
- allow JWT auth instead of mTLS

* update helm chart images

* fix rbac. note that JWT verification is not cached and might not work on some k8s distributions that not expose the JWKS

* fix: add chart boilerplate headers

* fix: support jwt helm install on plain kind

* feat: add substrate crds helm chart

* feat: make jwt helm installs standalone

* fix: make helm defaults cloud-neutral

* fix: sync crd chart templates

* fix: use agentgateway in helm chart

* fix: update agentgateway install overlays

* fix: project agentgateway tls key separately

---------

Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Peter Jausovec <peter.jausovec@solo.io>
Jonathan Jamroga (jjamroga) pushed a commit to jjamroga/substrate that referenced this pull request Jun 30, 2026
* enable websockets (agent-substrate#4)

Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Co-authored-by: Peter Jausovec <peter.jausovec@solo.io>

* feat: allow running with vanilla k8s

- add a helm chart
- allow JWT auth instead of mTLS

* update helm chart images

* fix rbac. note that JWT verification is not cached and might not work on some k8s distributions that not expose the JWKS

* fix: add chart boilerplate headers

* fix: support jwt helm install on plain kind

* feat: add substrate crds helm chart

* feat: make jwt helm installs standalone

* fix: make helm defaults cloud-neutral

* fix: sync crd chart templates

* fix: use agentgateway in helm chart

* fix: update agentgateway install overlays

* fix: project agentgateway tls key separately

---------

Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Peter Jausovec <peter.jausovec@solo.io>
Jonathan Jamroga (jjamroga) pushed a commit to jjamroga/substrate that referenced this pull request Jul 1, 2026
* enable websockets (agent-substrate#4)

Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Co-authored-by: Peter Jausovec <peter.jausovec@solo.io>

* feat: allow running with vanilla k8s

- add a helm chart
- allow JWT auth instead of mTLS

* update helm chart images

* fix rbac. note that JWT verification is not cached and might not work on some k8s distributions that not expose the JWKS

* fix: add chart boilerplate headers

* fix: support jwt helm install on plain kind

* feat: add substrate crds helm chart

* feat: make jwt helm installs standalone

* fix: make helm defaults cloud-neutral

* fix: sync crd chart templates

* fix: use agentgateway in helm chart

* fix: update agentgateway install overlays

* fix: project agentgateway tls key separately

---------

Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Peter Jausovec <peter.jausovec@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/docs Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants