Light | Dark |
---|---|
![]() |
![]() |
There are two parts of this project.
agent run on the GPU machines to collect the GPU metrics
A HTTP service built with FastAPI that exposes real‑time NVIDIA GPU metrics and running process details via simple JSON endpoints. It leverages NVML through the nvitop library (instead of shelling out to nvidia‑smi) for low overhead and structured data. The agent provides GPU inventory and per‑GPU stats (utilization, temperature, power, memory) plus per‑process info, with optional token header authentication and configurable URL prefixes. It’s container‑friendly and includes examples for Docker and systemd.
dashboard run on the web server to show the GPU metrics
A Next.js 14 dashboard that aggregates GPU status from multiple servers via a secure proxy API, then visualizes per‑server and per‑GPU metrics with process tables. It supports optional email code login with JWT, Redis‑backed verification and access logging, and runs as a Dockerized standalone build. Designed for responsive, dark‑mode‑friendly monitoring in clusters, it reads from deployed agents and presents smooth charts with configurable UI elements.