Skip to content

mcp: add system architecture awareness to analysis tools #889

@thinkingfish

Description

@thinkingfish

What

Make the MCP server (rezolus mcp) architecture-aware when analyzing recordings — the analysis tools (anomaly detection, correlation, PromQL queries, describe-metrics) currently work on metric values without consciously factoring in the underlying system topology and configuration.

Why

Many diagnostic conclusions are valid only in the context of the host's architecture. Without that context, the MCP server can produce technically correct but operationally misleading analysis.

Examples where architecture changes the interpretation:

  • "CPU 12 shows high softirq" — is CPU 12 listed in isolcpus? An SMT sibling of a hot CPU? On the wrong NUMA node for the device whose IRQs land there? The same number means three different things.
  • "Off-CPU time is high in cgroup X" — is the cgroup hitting cpu.max quota, or is it just IO-bound? Throttling vs scheduling pressure looks identical without the cgroup config.
  • "ENA allowance exceeded" — only meaningful on EC2 Nitro instances; on bare metal those counters never increment.
  • "Steal time is 5%" — context-free verdict differs between bare-metal (alarming) and shared-tenancy VM (often normal during a live-migration window).
  • "L3 miss rate is 30%" — depends on cache hierarchy size, which differs by CPU generation and NUMA layout.

The patterns documented in docs/patterns.md are explicitly architecture-conditional in many cases — that doc is for human operators; the MCP server should be able to apply the same conditioning automatically.

Categories of awareness that would help

  • CPU topology — cores, sockets, SMT siblings, NUMA nodes, cache hierarchy, frequency capabilities.
  • Cgroup hierarchy and configurationcpu.max, memory.high/memory.max, cpuset.cpus, parent/child relationships.
  • Kernel and userspace versioning — kernel version (which tracepoints exist), libc, key driver versions.
  • Cloud / hypervisor context — bare metal vs VM, cloud provider, instance type/family, hypervisor steal-time semantics.
  • Block / network device configuration — IO scheduler, queue depth, IRQ affinity, NVMe poll mode, NIC RSS/RPS state, multipath topology.
  • Boot-time isolation postureisolcpus, nohz_full, rcu_nocbs, governor settings.

The agent already captures some of this (systeminfo is in parquet metadata per docs/parquet_metadata.md); the gap is in the MCP tools using it consistently when interpreting metrics.

Concrete benefits

  • Anomaly detection that knows "normal" for the host shape rather than treating every host's metrics as i.i.d.
  • Correlations that respect topology (don't correlate "all CPUs" when the host has heterogeneous CPU pools, e.g. P-cores vs E-cores).
  • Suggested diagnostic queries that adapt to the platform — different on AWS Nitro vs bare metal vs Azure.
  • More confident answers to "is this metric value problematic" — the same value can be benign on one host and a SEV on another.

Out of scope

This issue is about the what and why. Specific design choices (where awareness lives, how it's expressed in tool output, schema for the topology context, etc.) should be decided when this is picked up.

Related

  • docs/parquet_metadata.md — the agent's existing systeminfo capture
  • docs/patterns.md — diagnostic patterns whose validity is architecture-conditional
  • docs: capture sampler development methodology and tradeoffs #883 — sampler-development methodology doc, which intersects on the question of "what does rezolus consider important about a system"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions