Incident context
The monitoring apply against noc failed in CI/CD at:
TASK [monitoring : Install node_exporter (Debian)]
Failed to update apt cache after 5 retries
Manual validation on noc showed the underlying apt error was disk exhaustion:
Error writing to file - write (28: No space left on device)
/dev/xvda1 20G 19G 0 100% /
The main growth source was unbounded router BGP snapshots:
/var/lib/hyrule-mcp/bgp-snapshots ~14G
bgp-router-snapshot.timer hourly
The snapshot metadata includes expires_at at 7 days, but nothing on the host enforces retention. Once / filled, apt index refreshes failed and therefore unrelated Ansible applies failed.
Immediate mitigation already applied
On noc:
- Stopped and disabled
bgp-router-snapshot.timer until retention is managed.
- Cleaned apt cache / journals.
- Verified
apt-get update succeeds.
- Current state after mitigation:
/dev/xvda1 20G 7.4G 12G 40% /
/var/lib/hyrule-mcp/bgp-snapshots 3.7G
bgp-router-snapshot.timer disabled/inactive
apt-get update OK
Engineering Loop task
Implement a permanent, reviewed fix so BGP snapshots cannot fill noc again, then safely re-enable collection.
Preferred Network-Operations-side approach:
- Manage
bgp-router-snapshot.service and bgp-router-snapshot.timer in Ansible, probably in ansible/roles/hyrule_mcp or the noc play path, instead of leaving them as manual drift.
- Add enforced local retention for
/var/lib/hyrule-mcp/bgp-snapshots before re-enabling the timer. Acceptable implementations:
systemd-tmpfiles policy for age-based deletion; or
- a small managed cleanup service/timer; or
- call a
hyrule-mcp retention flag if that exists/gets added.
- Keep the retention window aligned with snapshot metadata (
expires_at currently 7 days), or make it an Ansible default such as hyrule_mcp_bgp_snapshot_retention_days: 7.
- Add monitoring/guardrail if not already sufficient: root filesystem disk checks should alert before apt/apply breaks.
- Re-enable
bgp-router-snapshot.timer only after retention is active.
If the cleanest fix requires code changes in AS215932/hyrule-mcp, file/link a child issue or PR there, but this Network-Operations issue is the production rollout tracker.
Acceptance criteria
Suggested validation commands
ansible-playbook ansible/playbooks/noc.yml --tags apply -e '{"noc_apply":true}' --limit noc
ssh noc 'systemctl status bgp-router-snapshot.timer --no-pager'
ssh noc 'systemd-tmpfiles --clean || true'
ssh noc 'du -sh /var/lib/hyrule-mcp/bgp-snapshots; df -h /; apt-get update'
Incident context
The
monitoringapply againstnocfailed in CI/CD at:Manual validation on
nocshowed the underlying apt error was disk exhaustion:The main growth source was unbounded router BGP snapshots:
The snapshot metadata includes
expires_atat 7 days, but nothing on the host enforces retention. Once/filled, apt index refreshes failed and therefore unrelated Ansible applies failed.Immediate mitigation already applied
On
noc:bgp-router-snapshot.timeruntil retention is managed.apt-get updatesucceeds.Engineering Loop task
Implement a permanent, reviewed fix so BGP snapshots cannot fill
nocagain, then safely re-enable collection.Preferred Network-Operations-side approach:
bgp-router-snapshot.serviceandbgp-router-snapshot.timerin Ansible, probably inansible/roles/hyrule_mcpor thenocplay path, instead of leaving them as manual drift./var/lib/hyrule-mcp/bgp-snapshotsbefore re-enabling the timer. Acceptable implementations:systemd-tmpfilespolicy for age-based deletion; orhyrule-mcpretention flag if that exists/gets added.expires_atcurrently 7 days), or make it an Ansible default such ashyrule_mcp_bgp_snapshot_retention_days: 7.bgp-router-snapshot.timeronly after retention is active.If the cleanest fix requires code changes in
AS215932/hyrule-mcp, file/link a child issue or PR there, but this Network-Operations issue is the production rollout tracker.Acceptance criteria
nocfor/var/lib/hyrule-mcp/bgp-snapshots.playbooks/noc.yml(or the chosen owning playbook) is idempotent.noc:systemctl is-active bgp-router-snapshot.timerisactive.systemd-tmpfiles --cleansucceeds.df -h /has healthy free space.apt-get updatesucceeds.Suggested validation commands