You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make the MCP server (rezolus mcp) architecture-aware when analyzing recordings — the analysis tools (anomaly detection, correlation, PromQL queries, describe-metrics) currently work on metric values without consciously factoring in the underlying system topology and configuration.
Why
Many diagnostic conclusions are valid only in the context of the host's architecture. Without that context, the MCP server can produce technically correct but operationally misleading analysis.
Examples where architecture changes the interpretation:
"CPU 12 shows high softirq" — is CPU 12 listed in isolcpus? An SMT sibling of a hot CPU? On the wrong NUMA node for the device whose IRQs land there? The same number means three different things.
"Off-CPU time is high in cgroup X" — is the cgroup hitting cpu.max quota, or is it just IO-bound? Throttling vs scheduling pressure looks identical without the cgroup config.
"ENA allowance exceeded" — only meaningful on EC2 Nitro instances; on bare metal those counters never increment.
"Steal time is 5%" — context-free verdict differs between bare-metal (alarming) and shared-tenancy VM (often normal during a live-migration window).
"L3 miss rate is 30%" — depends on cache hierarchy size, which differs by CPU generation and NUMA layout.
The patterns documented in docs/patterns.md are explicitly architecture-conditional in many cases — that doc is for human operators; the MCP server should be able to apply the same conditioning automatically.
Categories of awareness that would help
CPU topology — cores, sockets, SMT siblings, NUMA nodes, cache hierarchy, frequency capabilities.
Cgroup hierarchy and configuration — cpu.max, memory.high/memory.max, cpuset.cpus, parent/child relationships.
Kernel and userspace versioning — kernel version (which tracepoints exist), libc, key driver versions.
Cloud / hypervisor context — bare metal vs VM, cloud provider, instance type/family, hypervisor steal-time semantics.
The agent already captures some of this (systeminfo is in parquet metadata per docs/parquet_metadata.md); the gap is in the MCP tools using it consistently when interpreting metrics.
Concrete benefits
Anomaly detection that knows "normal" for the host shape rather than treating every host's metrics as i.i.d.
Correlations that respect topology (don't correlate "all CPUs" when the host has heterogeneous CPU pools, e.g. P-cores vs E-cores).
Suggested diagnostic queries that adapt to the platform — different on AWS Nitro vs bare metal vs Azure.
More confident answers to "is this metric value problematic" — the same value can be benign on one host and a SEV on another.
Out of scope
This issue is about the what and why. Specific design choices (where awareness lives, how it's expressed in tool output, schema for the topology context, etc.) should be decided when this is picked up.
Related
docs/parquet_metadata.md — the agent's existing systeminfo capture
docs/patterns.md — diagnostic patterns whose validity is architecture-conditional
What
Make the MCP server (
rezolus mcp) architecture-aware when analyzing recordings — the analysis tools (anomaly detection, correlation, PromQL queries, describe-metrics) currently work on metric values without consciously factoring in the underlying system topology and configuration.Why
Many diagnostic conclusions are valid only in the context of the host's architecture. Without that context, the MCP server can produce technically correct but operationally misleading analysis.
Examples where architecture changes the interpretation:
isolcpus? An SMT sibling of a hot CPU? On the wrong NUMA node for the device whose IRQs land there? The same number means three different things.cpu.maxquota, or is it just IO-bound? Throttling vs scheduling pressure looks identical without the cgroup config.The patterns documented in
docs/patterns.mdare explicitly architecture-conditional in many cases — that doc is for human operators; the MCP server should be able to apply the same conditioning automatically.Categories of awareness that would help
cpu.max,memory.high/memory.max,cpuset.cpus, parent/child relationships.isolcpus,nohz_full,rcu_nocbs, governor settings.The agent already captures some of this (
systeminfois in parquet metadata perdocs/parquet_metadata.md); the gap is in the MCP tools using it consistently when interpreting metrics.Concrete benefits
Out of scope
This issue is about the what and why. Specific design choices (where awareness lives, how it's expressed in tool output, schema for the topology context, etc.) should be decided when this is picked up.
Related
docs/parquet_metadata.md— the agent's existingsysteminfocapturedocs/patterns.md— diagnostic patterns whose validity is architecture-conditional