Skip to content

feat(dash,pve): MemoryMap redesign — GTT-first, Proxmox-aware, sidebar + hardware page#370

Merged
thinmintdev merged 16 commits into
mainfrom
feat/memory-map-redesign
May 28, 2026
Merged

feat(dash,pve): MemoryMap redesign — GTT-first, Proxmox-aware, sidebar + hardware page#370
thinmintdev merged 16 commits into
mainfrom
feat/memory-map-redesign

Conversation

@thinmintdev
Copy link
Copy Markdown
Contributor

Summary

Rewrite of the dashboard's Memory map so it works correctly on Strix Halo UMA and surfaces Proxmox host pressure.

Before: read only /api/hardware (RAM total/used) + summed each slot's self-reported metrics.mem. On UMA the model bytes live in GTT, not RAM — a 6 GB ROCm slot moved the bar by ~0 GB. Silent on Proxmox host pressure.

After: single <MemoryMap variant="sidebar|expanded" /> driven by useMemoryMapModel. GTT-first attribution. Proxmox auto-detected when /etc/hal0/proxmox.json is missing. Two-tier host bar + tenants legend on the hardware page when configured.

What changed

Backend (pve / api):

  • pve.detect_proxmox_host() — best-effort LXC-on-PVE detection from /proc/version + /proc/1/cgroup. Never raises.
  • /api/stats/hardware host block now carries {detected, detection, hint} when unconfigured-but-detected. Configured path unchanged.

Frontend (UI):

  • New hooks: useStatsHardware() (2.5s, live counters incl. gtt_used_mb, npu_status.model_mb, host.*) and useProxmoxSettings() (10s, full tenants[] for the expanded legend that the slim stats endpoint strips).
  • New component ui/src/dash/memory-map.jsxuseMemoryMapModel (all attribution math) + MemoryMap renderer (sidebar + expanded variants).
  • Replaces inline MemoryMap in dashboard.jsx, swaps both call-sites in slots.jsx, adds HardwareMemorySection between HardwareSection and the side cards.

Design choices (settled in spec):

  • Per-slot attribution: NPU evenly split from npu_status.model_mb; GPU shares from gtt_used_mb weighted by metrics.mem; CPU self-reported. ≈ marker only when sharing a pool.
  • 2 GB safety margin baked into headroom.
  • Self-LXC filtered from tenants legend by hostname match against Hardware.name — avoids double-counting against selfShareGb.
  • Backend _PVE_CONFIGURE_HINT constant is single source of truth — tests import it.

Full spec: docs/superpowers/specs/2026-05-28-memory-map-redesign-design.md
Plan: docs/superpowers/plans/2026-05-28-memory-map-redesign.md

Test plan

  • uv run pytest tests/hardware tests/api -v — 488 passed, 3 skipped (pre-existing).
  • npx playwright test --reporter=line — 81 passed, 16 skipped (16 pre-existing skips unrelated).
  • npx playwright test memory-map-v36/6 new specs PASS covering off / detected_unconfigured / configured / pool-limited / host-limited / expanded variant.
  • Live smoke on hal0 LXC (10.0.1.142) — branch checked out, pip install -e . + systemctl restart hal0-api. /api/stats/hardware returns host: {configured:false, detected:true, detection:"uncertain", hint:"..."}; dashboard renders, sidebar Memory map visible.
  • Live smoke with proxmox.json configured — pending real PVE token; the configured-mode rendering hasn't been exercised against a live cluster yet. Mock-driven Playwright spec covers it.

Notes / follow-ups (not blocking)

  • Detection on this LXC fires UNCERTAIN, not DETECTED, because /proc/1/cgroup in cgroup-v2 unified mode shows /init.scope rather than /lxc/<vmid>/.... UI behaviour is correct (UNCERTAIN nudges identically to DETECTED). Broaden the cgroup signal to recognise /init.scope + -pve kernel as DETECTED — file as follow-up.
  • cgroup memory.max as a third headroom-binding-constraint candidate (after pool and host). Rare on Strix Halo; deferred.
  • Per-slot memory history sparkline — needs a time-series store; out of scope.
  • Formal --mem-tenant-{1,2,3} palette polish.

🤖 Generated with Claude Code

thinmintdev and others added 16 commits May 28, 2026 04:22
Adds PveDetectionState enum + helper that combines two cheap /proc
signals (kernel '-pve' tag + LXC cgroup shape). Returns DETECTED when
both fire, UNCERTAIN on one, NOT_DETECTED otherwise. Never raises.

Used in PR2 to surface a configure-Proxmox nudge in the memory-map
widget when /etc/hal0/proxmox.json is missing on a hosted LXC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Export detect_proxmox_host + PveDetectionState in __all__.
- Add test_never_raises_on_permission_error covering the OSError
  branch that the prior test's binary-content path didn't reach.
- Correct docstring wording (signals are 'strong' + 'medium', not
  'two strong'); replaced with neutral 'both signals present'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When /etc/hal0/proxmox.json is missing, the host block now carries
`detected: true/false` (+ a one-line hint when DETECTED) so the
dashboard MemoryMap can render a non-blocking 'Configure Proxmox →'
band instead of staying silent on hosted LXCs.

Shape stays backwards-compatible: `host.configured: false` still
holds; old clients that don't read the new keys see no change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the third test in TestHostDetectionInStatsHardware confirming the
configured: true pre-detection branch is untouched by the new code.
Also guards against accidental detect_proxmox_host() calls in the
configured path (raises AssertionError if reached).

Closes spec-reviewer gap noted on bef248a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- UNCERTAIN now also produces detected:true (matches pve.py docstring
  intent — UI nudges on both DETECTED and UNCERTAIN; only the
  detection field distinguishes them).
- Rename intermediate dict to host_block to avoid reusing 'slim' for
  two semantically different shapes.
- Extract _PVE_CONFIGURE_HINT constant so the test asserts against the
  same string the route emits.
- Add UNCERTAIN integration test (4th in TestHostDetectionInStatsHardware).
- Tighten configured-pass-through test to assert == project_slim(full)
  so additive regressions on the slim shape are caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the live-counter sibling of useHardware (static probe). Polls at
2.5s and surfaces gtt_used_mb, npu_status, and the host.* Proxmox
block. Consumed by MemoryMap in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove tenants?: from StatsHardwareHost — pve.project_slim() strips
  it before the response reaches /api/stats/hardware. Future expanded
  MemoryMap pulls tenants from /api/settings/proxmox via a separate
  hook (Task 5).
- Add per_upstream + upstream_names to StatsHardware (always emitted
  by the route's response builder).
- Align queryKey with useLemonade convention: ['stats', 'hardware'].

StatsHardwareTenant kept and exported for reuse by the future
settings-shape hook; docblock explains where it actually appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the normalising hook for the MemoryMap component (renderer lands
in Task 6). Fans in /api/hardware (static probe), /api/stats/hardware
(live counters), /api/slots, and /api/settings/proxmox (for tenants[]
which the stats endpoint slims out).

Per-slot attribution: NPU shares from npu_status.model_mb evenly;
GPU shares from gtt_used_mb, weighted by registry footprint when known.
CPU slots self-report. Other RAM = ram_used - sum(cpu shares).

Headroom = min(pool, host) - 2 GB safety margin; labelled by binding
constraint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- C1: drop ramTotalGb double-conversion — useHardware already
  returns ram.total in GB; the MB_PER_GB round-trip was a misleading
  no-op.
- I1: replace unifiedGb's wrong ram_used_mb fallback with ram_total_mb
  (which /api/stats/hardware already emits); StatsHardware interface
  updated to type the new field.
- I2: include pveSettings.isLoading in the model's loading flag so
  the renderer doesn't flash 'no tenants' on cold page loads.
- M2: delete dead StatsHardwareTenant export (useProxmoxSettings
  defines its own superset shape).
- M1: comment the stats-vs-settings cadence mismatch in the host
  block builder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure render layer over useMemoryMapModel. Sidebar variant matches the
existing side-card chrome; expanded variant uses the .card pattern
with a two-tier bar (host pool + inside-LXC) and a full legend.
Headroom callout names the binding constraint; PVE nudge appears
when detected_unconfigured.

Component lands unwired — Tasks 9-11 swap in the consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix dead PveNudge link: #settings/proxmox -> #settings (real route
  has no sub-path parser; settings view manages internal section).
- Add --mem-tenant-{1,2,3} CSS tokens to :root so the host bar's
  tenant segment renders with the intended amber-grey rather than
  silently falling back to var(--fg-5).
- data-loading attribute on sidebar variant root (was only on
  expanded), so future loading-shimmer rules apply to both.
- Filter the self-LXC out of the expanded variant's tenants legend
  by matching tenant name against Hardware.name (hostname). Avoids
  double-counting hal0's own LXC alongside selfShareGb.
- Move PveNudge inline styles into .memmap-pve-nudge CSS rule.
- Add scoped .memmap .dim / .memmap-expanded .dim utility rule plus
  .memmap-legend-sub spacing — removes inline marginLeft on LegendRow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the three host modes (off / detected_unconfigured / configured),
the binding-constraint headroom label (pool / host), and the expanded
variant's host pool + tenants legend.

All describes are .skip pending the wire-up commits (Tasks 9-11) that
mount MemoryMap into the actual dashboard routes. The mocks are in
place; unskip when the consumers land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deletes the inline MemoryMap that lived in dashboard.jsx and imports
the shared component from ./memory-map. The new component reads from
useSlots() / useHardware() / useStatsHardware() directly — no slots
prop needed. Visually equivalent in the off/detected_unconfigured
host modes; configured mode now surfaces host pressure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the two <MemoryMap slots={slots} /> call-sites in slots.jsx
(L713, L805) with the shared component. The slots prop is dropped —
the new MemoryMap pulls from useSlots() directly.

Resolves a latent break left by Task 9: dashboard.jsx no longer
exports MemoryMap to window, so slots.jsx's bare reference would
have resolved to the new memory-map.jsx window export by accident.
This makes the dependency explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renders the full-width MemoryMap (host pool + inside-LXC two-tier bar,
sortable legend, headroom callout) below HardwareSection in the main
column of /dashboard. Unskips the memory-map-v3 spec — all six tests
green against the new wire-up.

Fix: headroom selector tests scoped to .memmap-sidebar to avoid strict
mode violation now that both sidebar and expanded variants render
.memmap-headroom on the same page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…describes

The .skip was removed in 43d481c when Task 11 wired the consumers; the
describe titles still carried the gating hint. Cosmetic only — tests
already running green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant