Skip to content

Prevent noc BGP router snapshots from filling root filesystem #321

Description

@Svaag

Incident context

The monitoring apply against noc failed in CI/CD at:

TASK [monitoring : Install node_exporter (Debian)]
Failed to update apt cache after 5 retries

Manual validation on noc showed the underlying apt error was disk exhaustion:

Error writing to file - write (28: No space left on device)
/dev/xvda1 20G 19G 0 100% /

The main growth source was unbounded router BGP snapshots:

/var/lib/hyrule-mcp/bgp-snapshots ~14G
bgp-router-snapshot.timer          hourly

The snapshot metadata includes expires_at at 7 days, but nothing on the host enforces retention. Once / filled, apt index refreshes failed and therefore unrelated Ansible applies failed.

Immediate mitigation already applied

On noc:

  • Stopped and disabled bgp-router-snapshot.timer until retention is managed.
  • Cleaned apt cache / journals.
  • Verified apt-get update succeeds.
  • Current state after mitigation:
/dev/xvda1 20G 7.4G 12G 40% /
/var/lib/hyrule-mcp/bgp-snapshots 3.7G
bgp-router-snapshot.timer disabled/inactive
apt-get update OK

Engineering Loop task

Implement a permanent, reviewed fix so BGP snapshots cannot fill noc again, then safely re-enable collection.

Preferred Network-Operations-side approach:

  1. Manage bgp-router-snapshot.service and bgp-router-snapshot.timer in Ansible, probably in ansible/roles/hyrule_mcp or the noc play path, instead of leaving them as manual drift.
  2. Add enforced local retention for /var/lib/hyrule-mcp/bgp-snapshots before re-enabling the timer. Acceptable implementations:
    • systemd-tmpfiles policy for age-based deletion; or
    • a small managed cleanup service/timer; or
    • call a hyrule-mcp retention flag if that exists/gets added.
  3. Keep the retention window aligned with snapshot metadata (expires_at currently 7 days), or make it an Ansible default such as hyrule_mcp_bgp_snapshot_retention_days: 7.
  4. Add monitoring/guardrail if not already sufficient: root filesystem disk checks should alert before apt/apply breaks.
  5. Re-enable bgp-router-snapshot.timer only after retention is active.

If the cleanest fix requires code changes in AS215932/hyrule-mcp, file/link a child issue or PR there, but this Network-Operations issue is the production rollout tracker.

Acceptance criteria

  • Ansible owns the BGP snapshot timer/service or explicitly removes/disables unmanaged copies.
  • Retention is enforced automatically on noc for /var/lib/hyrule-mcp/bgp-snapshots.
  • The retention horizon is documented/configurable and defaults to 7 days.
  • Applying playbooks/noc.yml (or the chosen owning playbook) is idempotent.
  • After apply on noc:
    • systemctl is-active bgp-router-snapshot.timer is active.
    • snapshot cleanup policy/timer exists and is active, or tmpfiles policy is installed and systemd-tmpfiles --clean succeeds.
    • df -h / has healthy free space.
    • apt-get update succeeds.
  • The PR summary includes the incident root cause and the manual mitigation that disabled the timer.

Suggested validation commands

ansible-playbook ansible/playbooks/noc.yml --tags apply -e '{"noc_apply":true}' --limit noc
ssh noc 'systemctl status bgp-router-snapshot.timer --no-pager'
ssh noc 'systemd-tmpfiles --clean || true'
ssh noc 'du -sh /var/lib/hyrule-mcp/bgp-snapshots; df -h /; apt-get update'

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentic-ispAS215932/Hyrule agentic ISP operating-loop workansibleAnsible role / playbook / inventory workbugSomething isn't workingdiskengineering-handoffloop:knowledge-gapKnowledge context is missing, stale, or contradictorymonitoringnoc

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions