Prevent noc BGP router snapshots from filling root filesystem

## Incident context

The `monitoring` apply against `noc` failed in CI/CD at:

```text
TASK [monitoring : Install node_exporter (Debian)]
Failed to update apt cache after 5 retries
```

Manual validation on `noc` showed the underlying apt error was disk exhaustion:

```text
Error writing to file - write (28: No space left on device)
/dev/xvda1 20G 19G 0 100% /
```

The main growth source was unbounded router BGP snapshots:

```text
/var/lib/hyrule-mcp/bgp-snapshots ~14G
bgp-router-snapshot.timer          hourly
```

The snapshot metadata includes `expires_at` at 7 days, but nothing on the host enforces retention. Once `/` filled, apt index refreshes failed and therefore unrelated Ansible applies failed.

## Immediate mitigation already applied

On `noc`:

- Stopped and disabled `bgp-router-snapshot.timer` until retention is managed.
- Cleaned apt cache / journals.
- Verified `apt-get update` succeeds.
- Current state after mitigation:

```text
/dev/xvda1 20G 7.4G 12G 40% /
/var/lib/hyrule-mcp/bgp-snapshots 3.7G
bgp-router-snapshot.timer disabled/inactive
apt-get update OK
```

## Engineering Loop task

Implement a permanent, reviewed fix so BGP snapshots cannot fill `noc` again, then safely re-enable collection.

Preferred Network-Operations-side approach:

1. Manage `bgp-router-snapshot.service` and `bgp-router-snapshot.timer` in Ansible, probably in `ansible/roles/hyrule_mcp` or the `noc` play path, instead of leaving them as manual drift.
2. Add enforced local retention for `/var/lib/hyrule-mcp/bgp-snapshots` before re-enabling the timer. Acceptable implementations:
   - `systemd-tmpfiles` policy for age-based deletion; or
   - a small managed cleanup service/timer; or
   - call a `hyrule-mcp` retention flag if that exists/gets added.
3. Keep the retention window aligned with snapshot metadata (`expires_at` currently 7 days), or make it an Ansible default such as `hyrule_mcp_bgp_snapshot_retention_days: 7`.
4. Add monitoring/guardrail if not already sufficient: root filesystem disk checks should alert before apt/apply breaks.
5. Re-enable `bgp-router-snapshot.timer` only after retention is active.

If the cleanest fix requires code changes in `AS215932/hyrule-mcp`, file/link a child issue or PR there, but this Network-Operations issue is the production rollout tracker.

## Acceptance criteria

- [ ] Ansible owns the BGP snapshot timer/service or explicitly removes/disables unmanaged copies.
- [ ] Retention is enforced automatically on `noc` for `/var/lib/hyrule-mcp/bgp-snapshots`.
- [ ] The retention horizon is documented/configurable and defaults to 7 days.
- [ ] Applying `playbooks/noc.yml` (or the chosen owning playbook) is idempotent.
- [ ] After apply on `noc`:
  - [ ] `systemctl is-active bgp-router-snapshot.timer` is `active`.
  - [ ] snapshot cleanup policy/timer exists and is active, or tmpfiles policy is installed and `systemd-tmpfiles --clean` succeeds.
  - [ ] `df -h /` has healthy free space.
  - [ ] `apt-get update` succeeds.
- [ ] The PR summary includes the incident root cause and the manual mitigation that disabled the timer.

## Suggested validation commands

```bash
ansible-playbook ansible/playbooks/noc.yml --tags apply -e '{"noc_apply":true}' --limit noc
ssh noc 'systemctl status bgp-router-snapshot.timer --no-pager'
ssh noc 'systemd-tmpfiles --clean || true'
ssh noc 'du -sh /var/lib/hyrule-mcp/bgp-snapshots; df -h /; apt-get update'
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent noc BGP router snapshots from filling root filesystem #321

Incident context

Immediate mitigation already applied

Engineering Loop task

Acceptance criteria

Suggested validation commands

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Prevent noc BGP router snapshots from filling root filesystem #321

Description

Incident context

Immediate mitigation already applied

Engineering Loop task

Acceptance criteria

Suggested validation commands

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions