Skip to content

feat: add AS215932 extmon and diagnostics ops support#213

Open
Svaag wants to merge 19 commits into
mainfrom
chore/agentic-diagnostics-pricing-knobs
Open

feat: add AS215932 extmon and diagnostics ops support#213
Svaag wants to merge 19 commits into
mainfrom
chore/agentic-diagnostics-pricing-knobs

Conversation

@Svaag

@Svaag Svaag commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR contains the Network Operations side of the AS215932 / Agentic ISP Support rollout.

It adds operational support for the new Hyrule Cloud BGP/network-intelligence platform and exposes pricing knobs for the new agentic diagnostics APIs.

Major pieces:

  • extmon role expansion for out-of-AS215932 monitoring
    • Routinator
    • BGPalerter
    • BGPStream/PyBGPStream bootstrap
    • extmon-bgp-agent
    • extmon-diag-agent
    • Prometheus scrape jobs and alert rules
    • direct Discord alerting plus best-effort NOC webhook
  • BGPalerter pinned to v2.0.1 with SHA256
  • extmon rendered service/config templates
  • NOC-side AS215932 router-table snapshot timer/service templates
  • Hyrule Cloud runtime/deployment support for BGP data dir and ingest token
  • Vault/static env template price knobs for:
    • BGP/IP/DNS/RDAP/WHOIS/MX/Mail
    • Web/TLS
    • Path diagnostics
    • Port checks
    • NAT/CGNAT
    • Threat reputation
    • VoIP/SIP
    • Speedtest
  • docs updates for extmon and network flows

Review notes

This is paired with:

  • hyrule-cloud PR: x402 Agentic ISP Support API/Skill implementation
  • hyrule-mcp PR: AS215932 router snapshot collector

The important design constraint remains: extmon is outside AS215932 and must not hold router SSH keys. Router snapshots stay NOC-side.

Validation

After rebasing on current origin/main:

cd network-operations && ANSIBLE_ROLES_PATH=ansible/roles ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/cloud.yml --syntax-check
playbook: ansible/playbooks/cloud.yml

Previous local validation before the branch split also covered extmon/noc/cloud syntax checks and extmon render/agent py_compile.

@Svaag Svaag requested a review from a team as a code owner June 14, 2026 00:43

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a5d7daec4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +56 to +58
resolve_public(parsed.hostname or "")
req = urllib.request.Request(url, headers={"User-Agent": "AS215932-extmon-diag/1.0"})
with urllib.request.urlopen(req, timeout=timeout) as resp:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block redirects to private addresses

When a diagnostics request probes a public URL that returns a 30x to loopback/link-local/private infrastructure (for example metadata or 127.0.0.1), resolve_public only validates the original hostname, but urllib.request.urlopen follows redirects and resolves the redirected target itself. That bypasses the non-public-address guard and can return a body sample from internal services; disable redirects or re-validate each redirect target before fetching.

Useful? React with 👍 / 👎.

summary: 'BGP data source stale: {% raw %}{{ $labels.source }}{% endraw %}'

- alert: BGPalerterCriticalEvent
expr: increase(bgpalerter_alerts_total{severity="critical"}[10m]) > 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the first BGPalerter event

For the first critical BGPalerter webhook after the agent starts, or for any new channel/type labelset, the counter series first appears at 1; Prometheus cannot compute a positive increase() without a prior sample, and later samples remain at 1. This misses the one-shot hijack/withdrawal notification until a second matching event increments the same series, so initialize counters at 0 or alert from an event timestamp/gauge instead.

Useful? React with 👍 / 👎.


# BGP router table snapshots for paid Hyrule Cloud data products.
hyrule_mcp_bgp_snapshot_dir: "{{ hyrule_mcp_state_dir }}/bgp-snapshots"
hyrule_mcp_bgp_snapshot_ingest_url: "{{ lookup('ansible.builtin.env', 'HYRULE_BGP_INGEST_URL') | default('https://cloud.hyrule.host/v1/internal/bgp') }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the default ingest URL when env is unset

When HYRULE_BGP_INGEST_URL is not exported, Ansible's env lookup returns an empty string, and default(...) does not replace falsey values unless the second argument is true. The rendered bgp-router-snapshot.env therefore gets HYRULE_BGP_INGEST_URL= instead of the intended cloud endpoint, so the hourly snapshot job will not upload unless every deploy environment explicitly sets the URL.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🏅 Score: 72
🧪 No relevant tests
🔒 Security concerns

Yes. Multiple security concerns:
(1) The CAIDA BGPStream bootstrap script is fetched via unauthenticated HTTP and piped to bash, creating a supply-chain compromise vector. (2) The diag agent's resolve_public function has a logic flaw that can bypass private-IP blocking under crafted DNS responses. (3) The diag agent's ping/trace probes pass user-controlled addresses to subprocess without sanitization, enabling argument injection if the token is empty or leaked. (4) The EXTMON_DIAG_AGENT_TOKEN defaults to empty string in the env file, meaning the agent is unprotected by default until the operator explicitly sets a token. (5) The EXTMON_BGP_INGEST_TOKEN and EXTMON_BGP_CLOUDFLARE_API_TOKEN are rendered into a world-readable env file (mode 0600 is set in the install task, but the generated file in ansible/generated/extmon/bgp-agent.env has no mode restriction — it inherits the default from the render task which is 0644). (6) The HYRULE_BGP_INGEST_TOKEN is rendered into bgp-router-snapshot.env with mode 0640, which is group-readable; the group is hyrule_mcp_app_group which may include non-root users.

⚡ Recommended focus areas for review

Operational Footgun

The CAIDA BGPStream repository bootstrap (lines 52-62) runs a remote shell script piped into bash via curl -s "https://pkg.caida.org/os/$(lsb_release -si | awk '{print tolower($0)}')/bootstrap.sh" | bash. This is a classic supply-chain risk: the script is fetched over plain HTTP (no HTTPS pinned), and the URL is dynamically constructed from the OS release string. If the CAIDA bootstrap endpoint is compromised or the DNS is hijacked, arbitrary code can execute as root during the apply phase. This bypasses the SHA-pinned binary pattern used elsewhere in the PR (e.g., BGPalerter). The creates: /etc/apt/sources.list.d/caida.list guard only prevents re-execution after the first run, not the initial compromise.

- name: Apply — bootstrap CAIDA BGPStream repositories
  shell: |
    set -euo pipefail
    curl -s "https://pkg.caida.org/os/$(lsb_release -si | awk '{print tolower($0)}')/bootstrap.sh" | bash
  args:
    executable: /bin/bash
    creates: /etc/apt/sources.list.d/caida.list
  when:
    - extmon_apply | bool
    - extmon_install_bgpstream | bool
  tags: [apply]
Security Concern

The resolve_public function (lines 25-39) has a logic flaw: if blocked_ip(host) raises a ValueError, the bare except ValueError: pass on line 29-30 silently swallows the exception and continues to socket.getaddrinfo. This means a caller can bypass the blocked-IP check by passing a hostname that is itself a private IP string (e.g., "127.0.0.1" or "10.0.0.1"). The function will skip the check, resolve the hostname (which is already an IP), and then the subsequent blocked_ip(addr) check on line 35 will catch it — but only if the address resolves to the same IP. If the hostname resolves to a different private IP (e.g., via a crafted DNS response), the check is effectively bypassed. In practice, an attacker who can control the host parameter to the tcp or ping/trace probes could target internal RFC1918 or loopback addresses, bypassing the intended restriction to public targets only.

def resolve_public(host: str) -> list[str]:
    try:
        if blocked_ip(host):
            raise ValueError(f"blocked non-public target {host}")
    except ValueError:
        pass
    infos = socket.getaddrinfo(host, None, socket.AF_UNSPEC, socket.SOCK_STREAM)
    addrs = []
    for info in infos:
        addr = info[4][0]
        if blocked_ip(addr):
            raise ValueError(f"blocked resolved non-public target {addr}")
        if addr not in addrs:
            addrs.append(addr)
    return addrs
Security Concern

The run_probe function (lines 62-76) allows ping and trace probes that execute subprocess commands with user-controlled addr values (line 70). While resolve_public is called first, the addr variable is the first resolved address from getaddrinfo. If an attacker can control the host payload and the DNS resolves to a crafted address containing shell metacharacters (e.g., ; rm -rf /), the subprocess.run call on line 71 could execute arbitrary commands. Although subprocess.run with a list argument (not shell=True) prevents shell injection, the addr is passed as a positional argument to ping/traceroute, which could still allow argument injection (e.g., -c flag injection in ping). This is a realistic attack vector if the diag agent is exposed to untrusted clients (the token check on line 86 is optional — if TOKEN is empty, any POST is accepted).

if kind in {"ping", "trace"}:
    host = str(payload["host"])
    addr = resolve_public(host)[0]
    cmd = ["ping", "-c", "4", "-W", "2", addr] if kind == "ping" else ["traceroute", "-m", "20", addr]
    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=20, check=False)
    return {"ok": proc.returncode == 0, "returncode": proc.returncode, "output": (proc.stdout or proc.stderr)[-4096:]}

@github-actions

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Security
Fix SSRF bypass in resolve_public

The try-except block silently swallows the ValueError when a blocked IP is detected,
completely bypassing the SSRF protection. Remove the outer try-except so that any
non-public target immediately raises an exception and prevents execution of the
probe.

ansible/roles/extmon/templates/diag-agent.py.j2 [25-30]

 def resolve_public(host: str) -> list[str]:
-    try:
-        if blocked_ip(host):
-            raise ValueError(f"blocked non-public target {host}")
-    except ValueError:
-        pass
+    if blocked_ip(host):
+        raise ValueError(f"blocked non-public target {host}")
Suggestion importance[1-10]: 9

__

Why: The try-except silently swallows the ValueError when a blocked IP is detected, completely bypassing SSRF protection. Removing it ensures non-public targets are rejected immediately, which is a critical security fix.

High
Possible issue
Handle empty DNS resolution result

resolve_public may return an empty list if no addresses are resolved, causing an
unhandled IndexError that crashes the probe handler. Check the length of the list
before indexing and raise a clear error if no addresses are available.

ansible/roles/extmon/templates/diag-agent.py.j2 [42-45]

 def tcp(host: str, port: int, timeout: float = 10.0):
     ...
-    addr = resolve_public(host)[0]
+    addrs = resolve_public(host)
+    if not addrs:
+        raise ValueError(f"no public addresses resolved for {host}")
+    addr = addrs[0]
Suggestion importance[1-10]: 6

__

Why: resolve_public may return an empty list, causing an unhandled IndexError. Adding a length check prevents a crash and provides a clear error message, improving robustness.

Low
General
Prevent polling loop from crashing

An unexpected exception in poll_once() (e.g., a bug in the shared state lock) would
terminate the polling thread entirely, stopping all metric collection. Wrap the call
in a try-except to log any such error and continue the loop, ensuring the agent
remains responsive.

ansible/roles/extmon/templates/bgp-agent.py.j2 [139-142]

 def poll_loop() -> None:
     while True:
-        poll_once()
+        try:
+            poll_once()
+        except Exception:
+            pass  # Prevent loop exit on unexpected error
         time.sleep(POLL_SECONDS)
Suggestion importance[1-10]: 5

__

Why: An unexpected exception in poll_once() would terminate the polling thread, stopping all metric collection. Wrapping it in a try-except ensures the loop continues, though the individual poll functions already have error handling.

Low

@Svaag

Svaag commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

Live extmon VPS has been bootstrapped and this PR branch now points inventory at it.

Provisioned host:

  • IPv4: 45.32.179.134
  • IPv6: 2001:19f0:7402:0cd5:5400:06ff:fe40:7112
  • OS: Debian 13 / trixie
  • RAM: ~3.8GiB
  • Swap: ~7.7GiB
  • Hostname: extmon
  • SSH: key-based root access works with the AS215932 operator key
  • Personal account: svag created with AS215932 operator key and passwordless sudo
  • SSH hardening: password/kbd-interactive auth disabled; root login is key-only

Validated after inventory update:

ANSIBLE_CONFIG=ansible/ansible.cfg ansible -i ansible/inventory/hosts.yml extmon -m ping
extmon | SUCCESS

ANSIBLE_CONFIG=ansible/ansible.cfg ANSIBLE_ROLES_PATH=ansible/roles ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/extmon.yml --syntax-check
OK

ANSIBLE_CONFIG=ansible/ansible.cfg ANSIBLE_ROLES_PATH=ansible/roles ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/extmon.yml --tags validate --connection=local --limit extmon
OK

Still blocked before extmon_apply=true full service deployment:

  • EXTMON_DISCORD_WEBHOOK_URL
  • EXTMON_BGP_CLOUDFLARE_API_TOKEN
  • EXTMON_BGP_INGEST_TOKEN
  • EXTMON_DIAG_AGENT_TOKEN
  • OVH expiry collector credentials if that collector should be live on first apply

@Svaag Svaag force-pushed the chore/agentic-diagnostics-pricing-knobs branch from eed8734 to e47057f Compare June 14, 2026 14:45
Svaag and others added 2 commits June 14, 2026 16:59
Conflict resolutions:
- host_vars/api.yml: keep main's hyrule_cloud_monero_wallet_rpc_enabled: true
  (monero-wallet-rpc is active on api today; the branch's June disable is
  superseded by the #268 plan) and main's app SHA pins.
- hyrule_cloud/defaults required_env_keys: keep main's list; deliberately NOT
  adding HYRULE_BGP_INGEST_TOKEN — the render hook requires non-empty values
  and bgp_ingest_token is unconfirmed in Vault. The env line ships with an
  empty default; add to the required list after seeding Vault.
- vault_agent hyrule-cloud.env.ctmpl.j2 + configs/hyrule-cloud.env.j2: take
  the branch's Vault-toggleable PAYMENT_REQUIRE_NATIVE (default false, same
  behavior as main's #255 relax) and keep both appended blocks (branch BGP
  data/ingest + main customer-IPv6 provisioning).
- generated/noc/nftables.conf: keep all three post-base rules (extmon
  webhook, proxy dashboard, loop callback) in host_vars order; verified by
  re-rendering (firewall.yml --tags validate --limit noc → changed=0).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

2026-07-03 review + refresh — merging this closes a live drift hazard

Why now: this branch was deployed to extmon on 2026-06-14 (host up 2w5d with bgp-agent/bgpalerter/prometheus running) but never merged — so main still claimed extmon: 0.0.0.0 while a live host existed, and any extmon.yml/firewall apply from main would have regressed live state. Recorded as a drift item on #246; merging this PR removes the booby-trap.

Refresh: merged origin/main (209 commits) into the branch. Conflict resolutions, each with rationale:

File Resolution
host_vars/api.yml main's side — monero_wallet_rpc_enabled: true (wallet-rpc is active on api today; the June disable is superseded by #268) + main's app pins
hyrule_cloud/defaults required_env_keys main's side — deliberately not adding HYRULE_BGP_INGEST_TOKEN: the render hook demands non-empty values and bgp_ingest_token is unconfirmed in Vault (the #252/#253 failure pattern). The env line ships with an empty default; add it to the required list after seeding Vault.
vault_agent ctmpl + configs/hyrule-cloud.env.j2 branch's Vault-toggleable PAYMENT_REQUIRE_NATIVE (default false = same behavior as main's #255 relax; re-enabling XMR enforcement becomes a Vault write, which is what #268 wants) + both appended blocks kept (BGP data/ingest + customer-IPv6 provisioning)
generated/noc/nftables.conf all three post-base rules kept in host_vars order; verified by re-render (firewall.yml --tags validate --limit noc → changed=0)

Review findings:

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a79c0b3d65

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +37 to +39
module: [http_2xx_v6]
static_configs:
- targets: ['http://{{ extmon_bgpalerter_listen }}/status']

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Switch BGPalerter to a non-TLS probe

The http_2xx_v6 blackbox module in blackbox.yml.j2 sets fail_if_not_ssl: true, but this new job probes BGPalerter's loopback REST endpoint over plain http://. As soon as the job is scraped, blackbox marks probe_success=0, so the generic ProbeFailing rule will continuously alert even while BGPalerter is healthy; use an HTTP-without-TLS module or scrape the REST endpoint directly.

Useful? React with 👍 / 👎.

Comment on lines +120 to +122
if not CF_TOKEN:
_set_source("cloudflare_radar", False, "missing token")
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep the optional Cloudflare feed out of down alerts

When EXTMON_BGP_CLOUDFLARE_API_TOKEN is not set (the docs/host vars call Radar optional), this path publishes bgp_source_up{source="cloudflare_radar"}=0. The new BGPSourceDown rule alerts on any zero source, so a default deployment without the optional token will generate a permanent warning; skip emitting the source or mark it disabled instead of down.

Useful? React with 👍 / 👎.

Comment on lines +30 to +32
- job_name: extmon-bgp-agent
static_configs:
- targets: ['{{ extmon_bgp_agent_listen }}']

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add an up alert for the BGP agent

Adding the BGP agent as its own scrape job leaves it outside the only extmon up alert (ExtmonComponentDown matches up{job="extmon-self"}), and repo-wide search shows no other rule covers this job. If extmon-bgp-agent crashes or stops serving /metrics, Prometheus only sets up{job="extmon-bgp-agent"}=0 while the BGP metrics go stale, so none of the BGP source/visibility alerts fire; add an up alert for this job or include it in the self-monitoring set.

Useful? React with 👍 / 👎.

- P1 diag-agent SSRF-via-redirect: add _ValidatingRedirectHandler that
  re-runs resolve_public() on every 30x hop, and fix resolve_public so a
  literal private-IP host can no longer bypass the guard (the bare
  `except ValueError: pass` swallowed the intended block).
- P2 BGPalerterCriticalEvent missed the first event: increase() on a counter
  has no prior sample for a new labelset. Emit a per-severity
  bgpalerter_last_event_timestamp gauge and alert on recency instead.
- P2 optional Cloudflare Radar feed emitted bgp_source_up=0 when token-less,
  making BGPSourceDown fire permanently; skip registering the source instead.
- P2 blackbox-bgpalerter probed a plain-http loopback endpoint with the
  fail_if_not_ssl http_2xx_v6 module (always probe_success=0). Add a
  non-TLS http_2xx_local module and point the job at it.
- P2 extmon-bgp-agent scrape job had no up alert (outside the extmon-self
  matcher). Add ExtmonBGPAgentDown.

Templates + regenerated ansible/generated/extmon/ artifacts. py_compile +
yaml load clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Remedied all Codex review findings (commit c77a33e)

Went through both Codex rounds. One round-1 finding (P2 ingest-URL default) was already fixed on the branch in e47057f; the other five are addressed:

# Sev Finding Fix
1 P1 diag-agent follows 30x redirects to loopback/private without re-validating Added _ValidatingRedirectHandler re-running resolve_public() on every hop; also fixed resolve_public so a literal private-IP host can't slip past the except ValueError: pass
2 P2 BGPalerterCriticalEvent uses increase() on a counter → misses the first event Emit per-severity bgpalerter_last_event_timestamp gauge; alert on time() - max(...) < 600
3 P2 HYRULE_BGP_INGEST_URL default not applied on empty env Already fixed in e47057f (default(..., true))
4 P2 blackbox-bgpalerter probes plain-http loopback with the fail_if_not_ssl module New http_2xx_local module (no TLS requirement); job repointed
5 P2 optional Cloudflare Radar feed emits bgp_source_up=0 → permanent BGPSourceDown Skip registering the source when token-less (disabled ≠ down)
6 P2 extmon-bgp-agent scrape job outside the only up matcher Added ExtmonBGPAgentDown alert

Templates + regenerated ansible/generated/extmon/ committed together. py_compile + yaml.safe_load clean; extmon render re-run (--tags validate) reproduces the artifacts.

The deeper diag-agent items from the automated security summary (DNS-rebinding pin, arg-injection hardening, empty-token default, env-file modes, CAIDA curl|bash) are tracked separately in #361extmon-diag-agent is not running on the live host and stays off until that lands.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c77a33e98e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +296 to +299
name: bgp-router-snapshot.timer
state: started
enabled: true
daemon_reload: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce retention before enabling snapshot timer

On noc applies where the hyrule-mcp checkout exists, this starts the hourly bgp-router-snapshot.timer writing under /var/lib/hyrule-mcp/bgp-snapshots, but the existing docs/runbooks/noc-bgp-snapshot-retention.md says this timer must not be re-enabled until retention is managed because the same snapshots previously grew to ~14 GiB and filled /. I checked this change for a tmpfiles/cleanup policy and only see the directory/unit/timer being added, so this reintroduces the documented disk-fill failure mode; add retention before enabling the timer or leave it disabled.

Useful? React with 👍 / 👎.

PREFIXES = json.loads(os.environ.get("EXTMON_BGP_PREFIXES", '{{ extmon_bgp_prefixes | to_json }}'))
CF_TOKEN = os.environ.get("EXTMON_BGP_CLOUDFLARE_API_TOKEN", "")
BGPTOOLS_UA = os.environ.get("EXTMON_BGP_BGPTOOLS_USER_AGENT", "AS215932-bgp-observer/1.0")
POLL_SECONDS = int(os.environ.get("EXTMON_BGP_POLL_SECONDS", "300"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cache bgp.tools longer than the agent poll

With the default EXTMON_BGP_POLL_SECONDS unset, poll_loop() calls poll_bgp_tools() every 300 seconds and downloads the full https://bgp.tools/table.jsonl each time. The bgp.tools API docs state that this export updates only about every 30 minutes and ask clients not to download it more often than 30 minutes (https://bgp.tools/kb/api), so the normal deployment will fetch it 12 times/hour and risks throttling or blocking the bgp_tools source; give this feed its own >=30m cache/interval instead of using the 5m poll.

Useful? React with 👍 / 👎.

- P1 BGP snapshot timer re-enabled without retention (regresses incident
  #321, snapshots reached ~14 GiB and filled /). Add systemd-tmpfiles
  age-based retention (default 7d, matching snapshot expires_at and the
  retention runbook) and activate it BEFORE enabling bgp-router-snapshot.timer.
- P2 bgp.tools table.jsonl pulled every 5m by the poll loop; their API asks
  for <=30m. Rate-limit successful full-table fetches to a dedicated
  BGPTOOLS_MIN_INTERVAL (default 1800s), gated on last success so transient
  failures still retry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-3 findings remedied (commit 86e2af0)

Both new findings addressed:

  • P1 — snapshot timer re-enabled without retention (regresses Prevent noc BGP router snapshots from filling root filesystem #321): the timer previously grew /var/lib/hyrule-mcp/bgp-snapshots to ~14 GiB and filled /. Added systemd-tmpfiles age-based retention (e <dir> - - - {{ retention_days }}d, default 7d to match the snapshot expires_at and docs/runbooks/noc-bgp-snapshot-retention.md), written and activated (systemd-tmpfiles --clean) before the bgp-router-snapshot.timer enable task. Retention is now active before collection resumes, exactly as the runbook requires.
  • P2 — bgp.tools polled every 5m: their API (https://bgp.tools/kb/api) asks for ≤ every 30m. Full-table fetches now gated on BGPTOOLS_MIN_INTERVAL (default 1800s), keyed off the last successful fetch so transient failures still retry on the next cycle while healthy pulls stay within the published rate.

ansible-lint passes (production profile) on the changed role; py_compile + yaml load clean; extmon render reproduces the artifacts.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86e2af08d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

routing = _fetch_json(f"https://stat.ripe.net/data/routing-status/data.json?resource={q}")
data = routing.get("data", {})
origins = [int(o.get("origin")) for o in data.get("origins", []) if str(o.get("origin", "")).isdigit()]
visible = bool(data.get("last_seen") or origins)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use current RIPEstat origins for visibility

When the aggregate is withdrawn, RIPEstat keeps last_seen as historical ever-seen metadata while origins is the current exact-match announcer list (https://stat.ripe.net/docs/data-api/api-endpoints/routing-status). Because this treats any non-empty last_seen as visible, a prefix that was announced before but is no longer originated is exported as bgp_prefix_visible=1, so AS215932PrefixNotVisible will not fire for an AS215932 aggregate withdrawal unless another source catches it. Base visibility on the current origins/peer counts instead of last_seen.

Useful? React with 👍 / 👎.

Comment on lines +81 to +83
resolve_public(parsed.hostname or "")
req = urllib.request.Request(url, headers={"User-Agent": "AS215932-extmon-diag/1.0"})
with _OPENER.open(req, timeout=timeout) as resp:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin HTTP probes to the vetted DNS result

For customer-controlled HTTP/HTTPS diagnostic URLs, this validates the hostname once and then passes the original URL to urlopen, which performs its own DNS lookup; a DNS-rebinding name can return a public address for resolve_public() and then 127.0.0.1 or 169.254.169.254 for the actual request, bypassing the non-public-target guard and reaching extmon loopback/metadata services. The request needs to connect to the vetted address or revalidate the address used by the connection.

Useful? React with 👍 / 👎.

Comment on lines +21 to +23
def blocked_ip(addr: str) -> bool:
ip = ipaddress.ip_address(addr)
return ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_multicast or ip.is_reserved or ip.is_unspecified

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Block non-global CGNAT targets

On hosts where provider/internal services are reachable via CGNAT or shared address space, 100.64.0.0/10 is neither is_private nor is_reserved in Python's ipaddress, so resolve_public() allows probes to non-public CGNAT targets despite the guard's intent to reject non-public addresses. Use not ip.is_global with explicit allow/deny exceptions, or add the shared-address ranges to the block list.

Useful? React with 👍 / 👎.

- P1 diag-agent DNS-rebinding SSRF: resolve_public() vetted the hostname but
  urlopen re-resolved, so a rebinding name could return a public IP to the
  check and 127.0.0.1/169.254.169.254 to the fetch. Replace urlopen with
  _Pinned{HTTP,HTTPS}Connection that connect to the pre-vetted IP while keeping
  the hostname for Host header + TLS SNI/cert; redirects are followed manually
  so every hop is re-vetted and re-pinned.
- P1 RIPEstat visibility used `last_seen` (historical ever-seen), keeping
  bgp_prefix_visible=1 after a withdrawal and silencing
  AS215932PrefixNotVisible. Base visibility on the current `origins` set.
- P2 blocked_ip missed shared/CGNAT space (100.64.0.0/10 is neither is_private
  nor is_reserved). Use `not is_global` as the primary guard (also covers ULA),
  explicit terms kept as defence-in-depth.

Verified: py_compile clean; blocked_ip unit checks (CGNAT/private/loopback
blocked, public allowed); literal-private resolve_public raises; live pinned
https fetch returns 200.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-4 findings remedied (commit 42dc247)

  • P1 — diag-agent DNS-rebinding SSRF: resolve_public() vetted the hostname, but urlopen re-resolved at connect time, so a rebinding name could serve a public IP to the check and 127.0.0.1/169.254.169.254 to the actual fetch. Replaced urlopen with _PinnedHTTP(S)Connection that connects to the pre-vetted IP while keeping the hostname for the Host header and TLS SNI/cert validation; redirects are now followed manually so every hop is re-vetted and re-pinned. This supersedes the round-2 redirect handler and closes the gap Harden extmon diag/BGP agents: diag-agent input validation, CAIDA curl|bash, token defaults #361 had noted as the deeper item.
  • P1 — RIPEstat visibility: was keyed on last_seen (historical ever-seen), so a withdrawn aggregate stayed bgp_prefix_visible=1 and AS215932PrefixNotVisible wouldn't fire. Now based on the current origins announcer set.
  • P2 — CGNAT/shared space: blocked_ip missed 100.64.0.0/10 (neither is_private nor is_reserved). Switched the primary guard to not is_global (also covers IPv6 ULA), explicit terms kept as defence-in-depth.

Verified locally: py_compile clean; unit checks confirm CGNAT/private/loopback/metadata/ULA are blocked and public is allowed; the old literal-private bypass now raises; and a live pinned HTTPS fetch returns 200. extmon render reproduces the artifacts.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42dc247685

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ansible/inventory/host_vars/extmon.yml Outdated
extmon_bgp_cloudflare_api_token: "{{ lookup('ansible.builtin.env', 'EXTMON_BGP_CLOUDFLARE_API_TOKEN') | default('') }}"
extmon_bgp_hyrule_ingest_url: "{{ lookup('ansible.builtin.env', 'EXTMON_BGP_HYRULE_INGEST_URL') | default('https://cloud.hyrule.host/v1/internal/bgp', true) }}"
extmon_bgp_ingest_token: "{{ lookup('ansible.builtin.env', 'EXTMON_BGP_INGEST_TOKEN') | default('', true) }}"
extmon_bgp_bgp_tools_user_agent: "{{ lookup('ansible.builtin.env', 'EXTMON_BGP_BGPTOOLS_USER_AGENT') | default('AS215932-bgp-observer/1.0 (noc@as215932.net; +https://as215932.net)') }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the bgp.tools default User-Agent

When EXTMON_BGP_BGPTOOLS_USER_AGENT is unset, lookup('env', ...) returns an empty string, so this default(...) never applies without the boolean true argument; the generated env in this commit already renders EXTMON_BGP_BGPTOOLS_USER_AGENT=. The new bgp.tools poller then sends a blank User-Agent even though bgp.tools requires an identifying HTTP User-Agent and may block generic/missing ones, so the bgp_tools source can fail unless every operator sets the env var manually.

Useful? React with 👍 / 👎.

Comment thread ansible/roles/extmon/tasks/render.yml Outdated
Comment on lines +74 to +75
- { src: bgp-agent.env.j2, dest: bgp-agent.env }
- { src: diag-agent.env.j2, dest: diag-agent.env }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop rendering secret extmon env files into the worktree

When an operator follows the updated extmon runbook and exports the ingest/diagnostic tokens before running the render/validate step, these two env templates are written verbatim into the git-tracked ansible/generated/extmon/*.env files with mode 0644. Since the templates contain EXTMON_BGP_INGEST_TOKEN, the Cloudflare token, and EXTMON_DIAG_AGENT_TOKEN, a normal review diff can leave bearer tokens in the worktree and make them easy to commit or leak; skip these secret env files from generated output or redact the secret values.

Useful? React with 👍 / 👎.

- P2 bgp.tools User-Agent default never applied: env lookup returns "" when
  unset and default() ignores falsey without the boolean arg, so the poller
  sent a blank UA (bgp.tools may block it). Add `, true` to the default.
- P2 secret env files rendered into the worktree: bgp-agent.env / diag-agent.env
  carry EXTMON_BGP_INGEST_TOKEN, the Cloudflare token, and
  EXTMON_DIAG_AGENT_TOKEN. If an operator exported those before the render step
  they'd land in git-tracked generated/extmon/*.env at 0644, one `git add` from
  a leak. Stop rendering them to the worktree (install.yml already templates
  them straight to /etc/extmon/*.env at 0600), untrack the two files, and
  .gitignore the pattern.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-5 findings remedied (commit 2484e2a)

  • P2 — bgp.tools User-Agent default not applied: lookup('env', ...) returns "" when unset and default(...) ignores falsey values without the boolean arg, so the poller sent a blank UA (which bgp.tools may block). Added , true — same fix already applied to the ingest URL/token defaults.
  • P2 — secret env files rendered into the worktree: bgp-agent.env / diag-agent.env carry EXTMON_BGP_INGEST_TOKEN, the Cloudflare token, and EXTMON_DIAG_AGENT_TOKEN. An operator who exported those before the render/validate step would get bearer tokens written into git-tracked generated/extmon/*.env at 0644 — one git add from a leak. Stopped rendering them to the worktree (install.yml already templates them straight to /etc/extmon/*.env at 0600), untracked the two files, and added a .gitignore guard for the pattern.

Re-render is now changed=0 and produces no *.env under generated/extmon/; ansible-lint passes (production profile) on the changed files.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2484e2a051

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

q = urllib.parse.quote(prefix, safe="")
realtime = _fetch_json(f"https://api.cloudflare.com/client/v4/radar/bgp/routes/realtime?prefix={q}", headers=headers)
with STATE_LOCK:
STATE["prefixes"].setdefault(prefix, {})["cloudflare_radar"] = {"visible": bool(realtime.get("success", True)), "origins": []}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse Cloudflare routes before marking visibility

When EXTMON_BGP_CLOUDFLARE_API_TOKEN is set, this marks the prefix visible whenever the API request succeeds (and even defaults missing success to true) while recording no origins. Cloudflare Radar's realtime response carries the route/origin data in the returned route metadata, so a withdrawn prefix or MOAS/hijack can still export bgp_prefix_visible{source="cloudflare_radar"} 1 and no bgp_prefix_origin_visible series, defeating the feed that is meant to improve hijack/leak/MOAS visibility and sending misleading ingest payloads.

Useful? React with 👍 / 👎.

Comment on lines +126 to +128
with STATE_LOCK:
for prefix, row in found.items():
STATE["prefixes"].setdefault(prefix, {})["bgp_tools"] = {"visible": True, "origins": [row["asn"]]}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear bgp.tools state when the prefix disappears

On a successful bgp.tools table download that no longer contains a monitored prefix, this only updates entries in found and leaves the previous STATE["prefixes"][prefix]["bgp_tools"] value untouched. After a withdrawal or exact-prefix removal, the last visible=True/origin remains exported indefinitely for source="bgp_tools", so the visibility alert and the Hyrule ingest payload stay wrong until the agent restarts; each successful fetch should update every wanted prefix, setting missing ones invisible and clearing stale origin/hit data.

Useful? React with 👍 / 👎.

Both feeds exported stale/optimistic visibility that would mask a withdrawal:

- P2 bgp.tools left withdrawn prefixes untouched: only prefixes present in the
  latest table were updated, so a removed prefix kept exporting its last
  visible=True origin (and stale hit counters) forever. Now every wanted prefix
  is updated each successful fetch — missing ones flip to visible=False with
  empty origins, and stale per-origin hit counters are cleared.
- P2 Cloudflare Radar marked visible on HTTP success alone with no origins, so a
  withdrawal or MOAS/hijack still showed visible=1 and emitted no
  bgp_prefix_origin_visible series. Now parses result.routes for visibility and
  origin ASNs; raises (source marked down) on an unsuccessful API response.

Verified with mocked-upstream unit tests: present->visible+origins+hits,
withdrawal->invisible+cleared, for both feeds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-6 findings remedied (commit eb38385)

Both feeds were exporting stale/optimistic visibility that would mask a real withdrawal:

  • P2 — bgp.tools left withdrawn prefixes untouched: only prefixes present in the latest table were updated, so a removed prefix kept exporting its last visible=True origin (and stale hit counters) indefinitely. Now every wanted prefix is updated each successful fetch — missing ones flip to visible=False/empty origins and stale per-origin hit counters are cleared.
  • P2 — Cloudflare Radar marked visible on HTTP success alone with no origins, so a withdrawal or MOAS/hijack still showed visible=1 and emitted no bgp_prefix_origin_visible series. Now parses result.routes for visibility and origin ASNs, and raises (source marked down) on an unsuccessful API body.

Verified with mocked-upstream unit tests for both feeds: announced → visible=True + origins (+ hits for bgp.tools); withdrawn → visible=False, origins/hits cleared.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb383857a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +145 to +148
# Always return success after writing the textfile. Partial OVH API
# failures are represented by ovh_collector_errors_total and should alert
# through Prometheus; failing the oneshot leaves stale/missing metrics.
return 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Alert on OVH collector errors before marking runs successful

When any /dedicated, /ip, or /me/bill request fails, this now still writes a fresh ovh_collector_last_run_timestamp and exits 0, but the alert rules only cover staleness/expiry/unpaid metrics and never consume ovh_collector_errors_total. A credential or OVH API outage can therefore replace the textfile with missing/partial billing metrics plus a fresh timestamp, so Prometheus will not page about losing expiry coverage.

Useful? React with 👍 / 👎.

routes = (realtime.get("result") or {}).get("routes") or []
origins = []
for route in routes:
asn = route.get("origin_asn", route.get("originASN"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use Cloudflare route data to extract origins

When EXTMON_BGP_CLOUDFLARE_API_TOKEN is configured, Radar realtime route objects expose the origin through meta.prefix_origins or the last element of as_path, not origin_asn/originASN, so this loop drops every Cloudflare origin while still exporting the prefix as visible. In a MOAS or hijack that only this optional Radar feed sees, AS215932UnexpectedOrigin never gets a cloudflare_radar origin series to evaluate.

Useful? React with 👍 / 👎.

- P2 OVH collector failures were unmonitored: the collector intentionally
  writes its textfile (fresh timestamp) even when /dedicated, /ip, or /me/bill
  calls fail, representing failures via ovh_collector_errors_total — but no
  alert consumed that gauge, so a credential/API outage silently dropped expiry
  coverage. Add OVHCollectorErrors (errors_total > 0 for 30m).
- P2 Cloudflare origin extraction used origin_asn/originASN, but Radar realtime
  routes expose the origin via meta.prefix_origins or the last as_path element,
  so my round-6 code dropped every origin while still marking visible. Extract
  from all known shapes (meta.prefix_origins int|dict, as_path[-1], origin_asn
  fallback).

Verified: alerts.yml loads and includes OVHCollectorErrors; unit tests cover
all four cloudflare origin shapes plus a two-origin MOAS.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-7 findings remedied (commit 8a3e288)

  • P2 — OVH collector failures were unmonitored: the collector deliberately writes its textfile (with a fresh timestamp) even when /dedicated, /ip, or /me/bill calls fail, representing failures via ovh_collector_errors_total — but no alert consumed that gauge, so a credential/API outage silently dropped expiry coverage (including the rtr failover IP that protects NAT64). Added OVHCollectorErrors (ovh_collector_errors_total > 0 for 30m).
  • P2 — Cloudflare origin extraction (regression from my round-6 fix): used origin_asn/originASN, but Radar realtime routes expose the origin via meta.prefix_origins or the last as_path element, so origins were dropped while the prefix still showed visible. Now extracts from all known shapes (meta.prefix_origins int|dict, as_path[-1], origin_asn fallback).

Verified: alerts.yml loads and includes OVHCollectorErrors; unit tests cover all four Cloudflare origin shapes plus a two-origin MOAS case.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a3e288810

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +55 to +56
enableExpirationAlerts: true
enableExpirationCheckTA: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable ROA expiry checks with Routinator VRPs

When relying on extmon for ROA-expiration coverage, these checks are wired to the rpki provider below, which is Routinator's /json API; that JSON feed does not include the per-VRP expiration field that BGPalerter's monitorROAS expiry and TA-expiry checks require. The service will therefore appear to provide ROA-expiry alerting but will not emit those alerts from this feed; either use a provider such as rpkiclient that supplies expiry data or turn the expiry checks off to avoid false coverage.

Useful? React with 👍 / 👎.

- P2 BGPalerter monitorROAS advertised ROA/TA-expiry coverage it could not
  deliver: enableExpirationAlerts/enableExpirationCheckTA need a per-VRP
  `expires` field, but the configured rpki provider (Routinator /json) does not
  emit it, so those alerts would never fire. Disable both expiry checks (and
  the now-moot roaExpirationAlertHours) to avoid false coverage; keep
  enableDiffAlerts and enableDeletedCheckTA, which do not need expiry data.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-8 finding remedied (commit 1ea1b76)

  • P2 — BGPalerter advertised ROA/TA-expiry coverage it couldn't deliver: monitorROAS's enableExpirationAlerts/enableExpirationCheckTA need a per-VRP expires field, but the configured rpki provider (Routinator's /json VRP export) doesn't emit it, so those alerts would never fire. Disabled both expiry checks (and the now-moot roaExpirationAlertHours) to avoid false coverage; kept enableDiffAlerts and enableDeletedCheckTA, which don't depend on expiry data. If we later want real ROA-expiry alerting on extmon it needs an expiry-capable VRP source (e.g. rpkiclient).

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ea1b76069

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +76 to +79
isTemplateJSON: true
templates:
default: '{"source":"bgpalerter","channel":"${channel}","type":"${type}","summary":"${summary}","prefix":"${prefix}","asn":"${asn}","newprefix":"${newprefix}","neworigin":"${neworigin}","earliest":"${earliest}","latest":"${latest}"}'
showPaths: 5

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep BGPalerter JSON webhooks parseable

When BGPalerter emits a path-bearing hijack/visibility/RPKI alert, this combination makes the webhook body invalid before it reaches extmon-bgp-agent: in BGPalerter v2.0.1 reportHTTP appends the top paths to context.summary when showPaths > 0 and then, with isTemplateJSON, runs JSON.parse(blob) before posting. Since ${summary} is interpolated inside a JSON string here, those alerts can be dropped by the reporter instead of updating bgpalerter_last_event_timestamp, so the realtime critical BGPalerter alert path is lost; set showPaths: 0 for the JSON webhook or emit paths in a safely escaped field.

Useful? React with 👍 / 👎.

- P2 BGPalerter JSON webhook could drop critical alerts: with isTemplateJSON,
  showPaths>0 makes BGPalerter append the top paths into ${summary} and then
  JSON.parse() the interpolated body. The unescaped path blob breaks JSON, so a
  path-bearing hijack/visibility/RPKI alert would be dropped before reaching
  extmon-bgp-agent -- never updating bgpalerter_last_event_timestamp, losing
  the realtime critical-event path. Set showPaths: 0 (the agent only consumes
  channel/type/severity; paths are unused).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-9 finding remedied (commit dfb45a3)

  • P2 — BGPalerter JSON webhook could silently drop critical alerts: with isTemplateJSON, showPaths > 0 makes BGPalerter append the top paths into ${summary} and then JSON.parse() the interpolated body. That path blob isn't JSON-escaped, so a path-bearing hijack/visibility/RPKI alert would fail to parse and be dropped before reaching extmon-bgp-agent — never updating bgpalerter_last_event_timestamp, losing the realtime critical-event path this PR added. Set showPaths: 0; the agent only consumes channel/type/severity, so no information the agent uses is lost.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfb45a36f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rpki_max_length: 48
extmon_bgp_expected_adjacent_asns: [34872, 210233, 58057, 56755]
extmon_bgp_agent_listen: "127.0.0.1:9188"
extmon_diag_agent_listen: "127.0.0.1:9190"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expose a reachable path to the diagnostic agent

When the Hyrule Cloud diagnostics API tries to use this new extmon probe service from the api/cloud host, this default binds it only to extmon loopback, and the extmon UFW role still denies all inbound traffic except SSH, with no tunnel or reverse proxy configured for port 9190. That means /tcp, /http, /ping, etc. will only work from the extmon host itself; add an explicit reachable listener/firewall allowance from the intended caller or configure the tunnel/proxy the cloud service should use.

Useful? React with 👍 / 👎.

- P2 diagnostic agent had no reachable path for its intended caller: it binds
  loopback and UFW denies all inbound but SSH, so the Hyrule Cloud diagnostics
  API could not reach it. Rather than expose an unhardened active-probe service
  (its input-validation hardening is gated by #361), make the access path
  explicit and opt-in: add extmon_diag_agent_allowed_sources (default []), a
  UFW rule that opens the agent port to just those callers (no-op while empty),
  and document the SSH-port-forward default + the direct-exposure procedure.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Round-10 finding remedied (commit 961c245)

ansible-lint passes (production profile) on the changed role files.

@Svaag

Svaag commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

Reviewed commit: 961c245487

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86e77751ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


reports:
- file: reportFile
channels: [hijack, newprefix, visibility, path, misconfiguration, rpki, roa]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add a monitor for same-AS more-specifics

This subscribes the reports to newprefix, but the config never declares monitorNewPrefix, so nothing produces that channel. BGPalerter's docs distinguish this from monitorAS: monitorAS ignores prefixes already covered by a configured aggregate, while monitorNewPrefix is the monitor that catches the expected AS announcing a new more-specific (https://github.com/nttgin/BGPalerter/blob/main/docs/configuration.md#monitornewprefix). With only the /44 aggregate configured, a stray AS215932 /48 more-specific can therefore avoid the intended new-prefix alert; add monitorNewPrefix if same-AS more-specific leaks should page.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b5e9138aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

row = json.loads(line)
cidr = row.get("CIDR")
if cidr in wanted:
found[cidr] = {"asn": int(row.get("ASN", 0)), "hits": int(row.get("Hits", 0))}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve all bgp.tools origins

When bgp.tools exports multiple rows for the same CIDR (MOAS/hijack cases are documented at https://bgp.tools/kb/more-than-one-asn-per-prefix), this assignment keeps only the last row for that prefix. If the AS215932 row happens to be last, the bgp_tools metrics and ingest payload contain only origin 215932, so AS215932UnexpectedOrigin will not fire from this feed even though bgp.tools saw another origin; accumulate all ASNs/hit counts per CIDR instead of replacing earlier rows.

Useful? React with 👍 / 👎.

Comment on lines +22 to +26
- file: monitorPath
channel: path
name: path-matching
params:
thresholdMinPeers: 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wire path rules before enabling path alerts

This enables monitorPath, but the rendered prefixes.yml contains no per-prefix path: rules, and the newly added extmon_bgp_expected_adjacent_asns is never rendered anywhere. BGPalerter's monitorPath only evaluates AS_PATH regex rules declared on prefixes (https://github.com/nttgin/BGPalerter/blob/main/docs/configuration.md#monitorpath), so in the advertised path-alerting setup an unexpected upstream/downstream AS path will never produce a BGPalerter event.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant