Skip to content

feat(auth): Phase 2 — AWS Sigv4, GCP IAP, Azure AD providers (v0.11.0)#79

Merged
initializ-mk merged 22 commits into
initializ:mainfrom
naveen-kurra:pr2-cloud-native-auth-providers
May 24, 2026
Merged

feat(auth): Phase 2 — AWS Sigv4, GCP IAP, Azure AD providers (v0.11.0)#79
initializ-mk merged 22 commits into
initializ:mainfrom
naveen-kurra:pr2-cloud-native-auth-providers

Conversation

@naveen-kurra
Copy link
Copy Markdown
Collaborator

@naveen-kurra naveen-kurra commented May 24, 2026

Summary

Phase 2 of the pluggable auth provider work — three cloud-native providers
on top of the Phase 1 foundation (#77 / 7998f12). Customers authenticate
to Forge using identities they already have in their cloud; no parallel
IdP required.

Provider What it does Client token format
aws_sigv4 Caller mints a pre-signed STS GetCallerIdentity URL with their AWS SDK; Forge invokes it. STS returns the caller's canonical ARN. Same pattern as aws-iam-authenticator (EKS). Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url>
gcp_iap Verify the JWT GCP IAP forwards as X-Goog-Iap-Jwt-Assertion when Forge sits behind a GCP HTTPS LB + IAP. X-Goog-Iap-Jwt-Assertion: <jwt>
azure_ad Verify Microsoft Entra ID Bearer tokens with tenant lock-in and optional Microsoft Graph group enrichment. Composes the Phase 1 oidc provider. Authorization: Bearer <aad-jwt>

Forge never holds any IdP secrets — all three providers verify a caller-
minted credential against a third party (STS / GCP JWKS / AAD JWKS).

Why this matters

Today, putting Forge behind any of the three big cloud IdPs requires
standing up a parallel OIDC issuer (Cognito for AWS, Workspace SAML, etc.).
This PR removes that friction:

  • AWS: any Lambda / EC2 / EKS / IAM user calls Forge using their
    existing IAM credentials. Zero secrets stored on Forge, no token endpoint
    to host.
  • GCP: Forge sitting behind GCP LB+IAP consumes IAP's signed user
    assertion directly.
  • Azure: Entra tokens (humans, CI, Azure workloads) work natively with
    AAD-specific quirks (tenant gate, groups overage) handled correctly.

Design pivot during PR review — aws_sigv4 switched to pre-signed URL pattern

The PR went through an in-flight design correction caught by real-AWS
smoke testing. The TL;DR:

What was wrong (original design — commits 9a1ebae through 382294e)

The first design had clients sign their POST to Forge using AWS Sigv4
("header reflection"). Forge would forward the signed headers to STS,
expecting STS to validate them.

This is broken in a deterministic way: Sigv4 binds its signature to
the destination host
as part of the canonicalized signing input.
Headers signed for Forge's hostname can't be replayed against STS —
STS computes host: sts.<region>.amazonaws.com, recomputes the
signature, gets a different hash, and rejects with
SignatureDoesNotMatch.

Standard tools (awscurl, boto3.client('sts'), all AWS SDKs) always
sign for the URL they're calling, so there was no working client path.

What's correct (current design — commit 8568535)

Switched to the pre-signed URL pattern that aws-iam-authenticator
uses for EKS:

  1. Caller's AWS SDK mints a pre-signed GetCallerIdentity URL — signature
    embedded in query params, signed for STS's host.
  2. Caller base64-encodes the URL, prepends forge-aws-v1., sends as a
    standard Authorization: Bearer … header.
  3. Forge decodes, validates the URL host matches sts.<region>.amazonaws.com
    (SSRF guard), GETs the URL, parses the XML response, stamps Identity.

Why this is the right design (not just a fix)

Three design properties land cleanly:

  1. Client UX matches OIDC/JWT/Azure AD. All three Phase 2 providers
    now have the same caller experience: mint a Bearer token, attach it,
    send. Three lines of Python (or any AWS-SDK-bearing language). No
    custom signing logic, no header manipulation, no per-call procedure.
  2. No re-engineering of standard tools. aws-iam-authenticator-style
    token format works with any AWS SDK's SigV4QueryAuth API.
  3. Server is simpler. Forge doesn't reflect headers; it just GETs a
    URL. ~80 LOC of STS client code instead of header-forwarding plumbing.

Locked design decisions (unchanged through the pivot)

  • §9.1 — No aws-sdk-go-v2 dependency. STS client is hand-rolled HTTP + XML.
  • §9.2azure_ad composes Phase 1 oidc.Provider; no JWT/JWKS code in azure_ad/.
  • §9.3allowed_principals uses path.Match shell globs (no regex).
  • §9.4 — IAP issuer + JWKS URL hardcoded; no operator-overridable knob.

What clients write (the actual 3 lines)

import boto3, base64
from botocore.auth import SigV4QueryAuth
from botocore.awsrequest import AWSRequest

creds = boto3.Session().get_credentials().get_frozen_credentials()
req = AWSRequest(method='GET',
                 url='https://sts.us-east-1.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15')
SigV4QueryAuth(creds, 'sts', 'us-east-1', expires=900).add_auth(req)
token = 'forge-aws-v1.' + base64.urlsafe_b64encode(req.url.encode()).rstrip(b'=').decode()

requests.post(forge_url, headers={'Authorization': f'Bearer {token}'}, data=msg)

A reference client lives at scripts/forge-aws-sign.py (~80 LOC; CLI flags
for --region, --profile, --token-only, etc.).

Note on boto3: the high-level boto3.client('sts').generate_presigned_url('get_caller_identity', ...) returns a URL STS rejects with SignatureDoesNotMatch even on direct curl — known boto3 limitation (it signs as if the request were a POST). The reference client uses the lower-level SigV4QueryAuth API directly, same as aws-iam-authenticator does internally.

✅ Real-AWS validation (live AWS account)

End-to-end validated against real AWS STS using SSO assumed-role
credentials in account 412664885516. Forge instance built from
commit 8568535, running locally.

# Scenario Expected Actual Validates
1 curl with no auth headers 401, missing_token ✅ 401 + valid bearer token required Phase 1 middleware enforces auth; review #4 reason-code preserved
2 Pre-signed URL token, empty allowlist 200 / handler-error, auth_verify fires ✅ HTTP 400 (body shape; auth path complete), auth_verify emitted STS reflection works end-to-end against real STS
3 Same token, matching allowed_principals 200 / handler-error (auth pass) ✅ HTTP 400, auth_verify fires ArnMatcher accepts matching ARNs
4 Same token, wrong-account allowed_principals 401, rejected ✅ 401 + token rejected by auth provider ArnMatcher correctly denies non-matching ARNs

Audit log emitted (Test #2 success path)

{
  "event":"auth_verify",
  "fields":{
    "method":"POST",
    "path":"/tasks/send",
    "provider":"aws_sigv4",
    "user_id":"arn:aws:sts::412664885516:assumed-role/AWSReservedSSO_PowerUserAccess_c794d5f2c2fe4370/Naveen",
    "org_id":"412664885516",
    "token_kind":"sigv4",
    "groups_count":0,
    "remote_addr":"[::1]:62448"
  }
}

Every field correct: provider matches the configured name, user_id is the STS-returned canonical ARN (assumed-role form, including session name), org_id is the AWS account number, token_kind is the new forge-aws-v1.-prefix detection.

Two bugs caught during live testing (both fixed in 8568535)

  1. URL re-encoding through net/url. Round-tripping the presigned URL through Go's *url.URL.String() re-encoded query params in ways that differed from how the AWS SDK emitted them (e.g., / in X-Amz-Credential, + inside X-Amz-Security-Token). STS recomputed the canonical request from those re-encoded bytes and rejected. Fix: PresignedToken keeps a RawURL string field with the byte-for-byte original; the parsed *url.URL is used only for SSRF host validation and query-param inspection.
  2. boto3's generate_presigned_url quirk. The reference client originally used boto3.client('sts').generate_presigned_url('get_caller_identity', …), which produces a URL STS rejects with SignatureDoesNotMatch (known boto3 quirk — signs as if POST). Fix: the reference client now uses the lower-level SigV4QueryAuth.add_auth() directly, same pattern aws-iam-authenticator uses.

Other layers also exercised live

  • Egress allowlist auto-extended to include sts.us-east-1.amazonaws.com — Forge's outbound STS call was permitted.
  • Loopback static_token still works for the dashboard at http://localhost:9999/ (auto-prepended via PrependChain — Phase 1 review Add per-agent secrets, build signing, and forge framework #10 invariant intact under Phase 2).
  • Config hot-reload picks up forge.yaml content changes for most fields. Caveat: auth-chain providers are constructed once at startup; modifying allowed_principals requires a hard restart (Ctrl-C + forge run again) for the new allowlist to take effect. Documented as a follow-up — affects all providers, not just aws_sigv4.
  • Auth-before-Egress wizard ordering (639bfa9) ensures wizard-scaffolded configs include the STS host in egress_hosts.

What was NOT live-tested (and why)

  • gcp_iap provider — requires a GCP project with HTTPS LB + IAP enabled. Covered by unit tests with a fake JWKS signer.
  • azure_ad provider — requires an Entra tenant + app registration. Covered by unit tests with a fake AAD.

Per PHASE2_TEST_STRATEGY.md §8.2, those live tests run at release-tag time, not on every PR.

What landed (12 commits)

# Commit Theme
PR 1 55942d8 Header contract extension (foundation)
PR 2 9a1ebae aws_sigv4 provider (initial — header-reflection design)
PR 3 5b71071 gcp_iap provider
PR 4 1e23140 azure_ad provider
PR 5 47c4748 Wizard + non-interactive flags + Web UI integration
PR 6 98578f9 Operator docs + CHANGELOG
fix 639bfa9 Auth step runs before Egress; auth hosts auto-added to allowlist
audit b5f303b Audit-pass refinements (iap_jwt token_kind, URL parse caching, cleanup)
docs aaf8375 Gitignore docs/auth (per reviewer feedback)
client 382294e Reference client + "client signing contract" docstring (original pattern)
pivot b3444c2 Switch aws_sigv4 to pre-signed URL pattern
live 8568535 Live-test fixes: preserve raw URL; use SigV4QueryAuth in client

Total: ~42 files, +5,500 / -700 lines net (mostly tests + the design-pivot rewrite).

Phase 1 compatibility

  • Zero forge.yaml changes required for callers using Phase 1 providers (static_token, oidc, http_verifier). Phase 1 test suite passes unmodified.
  • The auth Headers map gained one new key — X-Goog-Iap-Jwt-Assertion for gcp_iap. Existing keys unchanged.
  • The oidc package gained an internal SkipIssuerCheck field with yaml:"-" — unreachable from forge.yaml, only set by azure_ad multi-tenant. Operators see no change.

Security model highlights

  • No IdP secret on Forge. STS / GCP / AAD do the cryptographic work; Forge reflects or verifies, never holds keys.
  • SSRF guard on aws_sigv4 — the pre-signed URL host MUST match sts.<configured-region>.amazonaws.com. A token whose URL points elsewhere is rejected at parse time, before any outbound request.
  • Algorithm whitelist before key lookup on the OIDC / IAP / AAD providers — algorithm-confusion attacks rejected before reaching JWKS.
  • Tenant gate before Graph on azure_ad — multi-tenant requires explicit opt-in; Graph calls only fire after tid validation.
  • Caller's Bearer reflected to Graph — Forge holds no Graph credentials.

Known deferred work

  • TUI sub-step input flows for the three new providers — non-interactive flag path is the production-critical surface; TUI users pick "Custom" and edit forge.yaml directly until the TUI follow-up lands.
  • Hot-reload of auth chain — config edits to auth.providers (incl. allowed_principals) require a hard forge run restart. Affects all providers, not just Phase 2.

Test plan

  • Unit tests: happy path, malformed input, scope/tenant rejection, allowlist miss, cache hit/miss, stale-grace, soft-fail, alg confusion, body caps, redirect attacks — all three new providers
  • Pre-signed URL parser fuzz test (corpus-seeded, runs in CI)
  • Phase 1 regression suite passes unmodified
  • go test -race -count=1 ./... — 42 packages green
  • golangci-lint v2.10.1 — 0 issues
  • gofmt -l forge-core forge-cli forge-plugins — clean
  • Anti-pattern grep gates (no aws-sdk-go import, IAP constants confined, no JWT in azure_ad, skip_issuer_check never in YAML) — all pass
  • Live AWS validation — happy + deny paths against real STS (see "Real-AWS validation" above)
  • Manual smoke for gcp_iap and azure_ad — runs at release-tag time per PHASE2_TEST_STRATEGY.md §8.2

Design artifacts (offline)

Full design package in ~/Desktop/forge_designs_and_PRD/phase2_implementation/:

  • PHASE2_CLOUD_NATIVE_PROVIDERS.md — top-level design + §9 locked decisions
  • PHASE2_PROGRESS_MAP.md — diagram-to-PR tracker
  • PHASE2_TEST_STRATEGY.md — pyramid, harnesses, security catalog, CI gates, manual smoke
  • 7 PlantUML diagrams (component / class / startup / request flow / per-provider deep-dives)
  • PR1_HEADER_CONTRACT.md through PR6_DOCS.md — per-PR checklists with code sketches and acceptance criteria

Naveen Kurra added 12 commits May 23, 2026 19:19
…r 1)

HeadersFromRequest gains Authorization, X-Goog-Iap-Jwt-Assertion,
X-Amz-Date, X-Amz-Security-Token so future providers consuming
non-Bearer formats (aws_sigv4, gcp_iap) can read what they need
without changing the Provider.Verify signature.

TokenKind recognizes the "AWS4-HMAC-SHA256 " prefix and returns
"sigv4", so audit logs can distinguish Sigv4 requests from "empty"
even though the Bearer extractor returns "".

Middleware now consults the chain even when no Bearer token was
extracted, provided a non-Bearer auth header is present (Sigv4
Authorization or IAP assertion). When NO auth headers at all are
present, the audit reason still resolves to ErrMissingBearer —
preserving review initializ#4's stable "missing_token" reason code.

Phase 1 providers see zero behavior change; their Verify path is
unchanged. All Phase 1 tests pass without modification.
…e 2 pr 2)

aws_sigv4 authenticates AWS-IAM callers by reflecting their Sigv4
signature to STS GetCallerIdentity. No aws-sdk-go-v2 dependency
(decision §9.1): the STS RPC is ~150 LOC of hand-rolled HTTP +
XML. Forge never holds the caller's secret key — STS validates
the signature on Forge's behalf.

Key pieces:
- sigv4_parser.go: pure string parser, fuzz-tested, never panics
- sts_client.go: 200/4xx/5xx classification per review initializ#6 contract
- identity_cache.go: hash(AKID|YYYYMMDD)-keyed TTL cache, opportunistic
  eviction past 10k entries, Put does NOT extend prior expiry
- arn_matcher.go: shell-style globs via path.Match (decision §9.3),
  invalid patterns fail at Factory time
- provider.go: scope check (service=sts, region match) before any
  STS round trip, cache hit avoids RPC, rejection does NOT poison
  the cache

security:
- Algorithm: only AWS4-HMAC-SHA256 prefix is claimed
- Scope: cross-service replay (s3->sts) and cross-region replay
  (eu-west-1->us-east-1) rejected at parse-time
- Cache: bucketing by YYYYMMDD bounds stolen-key window to a day
- Body cap: 64 KiB on STS responses
- Logs: STS error bodies summarized at 200 chars, newlines stripped

audit:
- ErrTokenNotForMe   -> not_for_me   (no AWS4 prefix)
- ErrInvalidToken    -> invalid      (malformed Sigv4)
- ErrTokenRejected   -> rejected     (scope/allowlist/STS 4xx)
- ErrProviderUnavailable -> provider_unavailable (STS 5xx/network)

extras:
- security.AuthDomains gains sts.<region>.amazonaws.com (+ override
  host when sts_endpoint set for tests)
- forge-cli/runtime/runner.go side-effect imports aws_sigv4
gcp_iap consumes the X-Goog-Iap-Jwt-Assertion header that GCP's
Identity-Aware Proxy forwards on every authenticated request when
Forge sits behind a GCP HTTPS load balancer with IAP enabled.

Decision §9.4: IAP issuer (https://cloud.google.com/iap) and JWKS
URL (https://www.gstatic.com/iap/verify/public_key-jwk) are
hardcoded. They're the only stable contract GCP exposes; an
override knob would be a footgun.

key pieces:
- iap_jwks.go: ES256-only JWKS cache, TTL refresh + backoff +
  stale-grace (mirrors Phase 1 OIDC review initializ#1 pattern)
- provider.go: header-presence check, claims projection, iss/aud
  gates, sub/email required-claims check
- parseECJWKSet drops non-EC / non-P-256 / non-ES256-labeled keys
  during parse — defense in depth against compromised JWKS
- alg whitelist rejects RS256 BEFORE key lookup (algorithm-
  confusion defense)
- aud as string OR array both parse (JWT spec allows either)
- audit reasons follow Phase 1 contract:
    rejected             — iss/aud mismatch, expired, bad signature
    invalid              — alg != ES256, missing sub/email, bad kid
    provider_unavailable — JWKS fetch failed AND no prior key cached
    not_for_me           — header absent

extras:
- security.AuthDomains returns www.gstatic.com when gcp_iap is configured
- forge-cli/runtime/runner.go side-effect imports gcp_iap
azure_ad authenticates Microsoft Entra ID tokens. Composes the
Phase 1 oidc.Provider (decision §9.2) for signature verify + base
claim validation; layers AAD-specific concerns on top:
  - Tenant lock-in via the tid claim
  - Optional Microsoft Graph group enrichment when JWT groups claim
    is empty (AAD truncates at ~200 groups)
  - Single-tenant vs multi-tenant issuer template

key pieces:
- provider.go: composed oidc + tenant gate + Source overwrite to
  "azure_ad" (replaces the inner "oidc" stamp)
- tenant.go: ExtractTenantID — typed accessor for the tid claim
- graph_client.go: Graph /me/transitiveMemberOf with pagination,
  same-host enforcement (rejects redirect attacks), 401/403 ->
  ErrTokenRejected, 5xx -> ErrProviderUnavailable, defensive
  cap at 5000 groups, body cap 1 MiB per page
- graph_cache.go: 5 min TTL, same shape as aws_sigv4's cache

key decisions:
- oidc.Config gains internal SkipIssuerCheck flag with yaml:"-"
  so it CANNOT be set via forge.yaml — only callable from another
  Go package. AAD multi-tenant uses it; everything else leaves it
  off. Surfacing it in YAML would let operators disable iss
  validation by accident.
- Soft-fail on Graph 5xx/401: Identity returned with empty Groups
  rather than blocking prod traffic. Hard-fail mode (graph_required)
  out of scope for v0.11.
- Forge reflects the CALLER's Bearer to Graph; holds no Graph
  credentials of its own.

audit reasons:
- ErrTokenRejected     -> rejected (tid mismatch, bad sig, Graph 401)
- ErrInvalidToken      -> invalid  (missing tid, malformed claims)
- ErrProviderUnavailable -> provider_unavailable (Graph 5xx, JWKS down)

extras:
- security.AuthDomains returns login.microsoftonline.com always;
  graph.microsoft.com when groups_mode=graph
- forge-cli/runtime/runner.go side-effect imports azure_ad
…(phase 2 pr 5)

Wires aws_sigv4, gcp_iap, and azure_ad into the operator surfaces:

cli (forge-cli/cmd/init*.go):
- New non-interactive flags namespaced --auth-aws-* / --auth-gcp-iap-* /
  --auth-azure-* (StringSlice for repeatable allowed-principal globs)
- buildAuthFromFlags validates required combinations and emits the
  right egress hosts per provider (sts.<region>.amazonaws.com,
  www.gstatic.com, login.microsoftonline.com + graph.microsoft.com
  when groups_mode=graph)
- authEgressHostsFromSettings mirrors the same logic for the Web UI
- renderAuthBlock supports []string lists with proper YAML quoting
  (allowed_principals)

web ui (forge-ui/handlers_create.go):
- AuthProviderTypeMeta lists the three new types with helpful labels

validate (forge-core/validate/auth.go):
- knownAuthProviderTypes admits aws_sigv4 / gcp_iap / azure_ad
- validateProviderSettings enforces per-type required keys
  (aws_sigv4.region, gcp_iap.audience, azure_ad.audience +
   tenant_id-unless-multi-tenant, azure_ad.groups_mode whitelist)

tests:
- 11 new renderer + flag-parsing tests
- Round-trip YAML parse used instead of brittle quote-pattern asserts
- Updated wizard-meta test to expect 7 auth provider types

deliberate scope cut:
- TUI step_auth.go sub-step input flows for the 3 new providers are
  NOT included. Adding them is mechanical (~100 LOC per provider,
  mirroring the OIDC issuer→audience→groups_claim phase chain) but
  out of scope for v0.11.0 cut. Non-interactive flag path covers the
  production-critical CI/CD case; operators using the TUI can pick
  "Custom" and edit forge.yaml directly until the follow-up lands.
…pr 6)

Adds the operator-facing documentation for the three Phase 2 providers
that shipped in PRs 1–5, plus a top-level auth index, chain-semantics
concepts page, CHANGELOG, and a README link.

new docs:
- docs/auth/index.md — provider matrix and chain-semantics overview
- docs/auth/concepts/chain.md — first-match-wins, no-fall-through on
  reject, non-Bearer header support, mixed-chain worked example
- docs/auth/providers/aws_sigv4.md — STS reflection setup, awscurl
  example, assumed-role-vs-IAM-role gotcha called out twice
- docs/auth/providers/gcp_iap.md — backend service ID lookup steps,
  hardcoded JWKS rationale, GCP IAM Conditions for allowlisting
- docs/auth/providers/azure_ad.md — app registration walkthrough,
  single/multi/graph mode configs, multi-tenant warning prominent

every provider doc includes:
- Prerequisites checklist
- forge.yaml example
- Configuration reference table
- Audit log shape (literal JSON)
- Troubleshooting matrix (grep-able reason codes)
- Security model + limitations sections

CHANGELOG.md (new file):
- Lists Added / Changed entries for v0.11.0
- "Notes for upgraders" makes the non-breaking nature explicit
- Calls out the known TUI sub-flow gap from PR 5

README.md:
- Adds Auth Providers row to the Security documentation table
…owlist

The wizard was asking for Egress confirmation before the operator had
picked an auth provider, so STS / AAD authority / IAP JWKS hosts never
appeared in the egress list. Forge would scaffold a forge.yaml whose
egress_hosts blocked its own auth-provider RPC calls — failure happens
later at `forge run`, with no signal the wizard could have caught.

changes:
- Swap step order in init.go: Auth now runs immediately before Egress
- Extend DeriveEgressFunc with (authMode, authSettings) so the Egress
  step's Prepare(ctx) pulls the operator's auth choice from
  WizardContext and forwards it into deriveEgressDomains
- deriveEgressDomains calls authEgressHostsFromSettings (same helper
  the non-interactive --auth=… path uses) — TUI and CLI now produce
  identical egress lists for any given auth choice
- EgressStep's inferSource() learns to label auth-derived hosts:
    sts.<region>.amazonaws.com → "aws_sigv4 auth"
    www.gstatic.com           → "gcp_iap auth"
    login.microsoftonline.com → "azure_ad auth"
    graph.microsoft.com       → "azure_ad auth (graph)"
    <oidc issuer host>        → "oidc auth"
    <http_verifier url host>  → "http_verifier auth"

tests:
- TestDeriveEgressDomains_AuthProviderHostsMerged: 8 cases pinning the
  per-provider host emission (incl. graph-mode adds graph host)
- TestDeriveEgressDomains_AuthHostsMergeNotOverwrite: auth pass is
  additive — provider / channel hosts still emit alongside auth hosts

docs:
- docs/auth/concepts/chain.md gains a "TUI wizard ordering" section
  explaining the Auth-before-Egress invariant
…, cleanup

Final-pass audit findings against the phase 2 design doc surfaced one
correctness bug and several small improvements. All gates clean
(go test -race / golangci-lint / gofmt). 42 packages pass.

BUG fix — middleware emits token_kind="iap_jwt" for IAP requests:
  The strategy doc §5/§10 lists five token_kind values: empty, opaque,
  jwt, sigv4, iap_jwt. PR1 wired sigv4 detection but missed iap_jwt,
  so successful GCP IAP requests audited with token_kind="empty" — the
  same value as no-auth requests, defeating the audit-pipeline goal of
  counting IAP traffic distinctly. Middleware now classifies
  X-Goog-Iap-Jwt-Assertion presence as kind="iap_jwt" on the
  empty-Bearer path. New regression test pins it.

Improvement — graph_client.go avoids per-page URL re-parse:
  ensureGraphHost was parsing GraphClient.endpoint via url.Parse on
  EVERY pagination step. Pre-parse the endpoint Host once at construction
  and compare against that string instead. Trims redundant work on
  multi-page Graph responses.

Improvement — gcp_iap classifyJWTErr ordering hardened:
  Replaced the bare substring match on "kid" (which would catch unrelated
  errors) with the specific patterns: "kid " (e.g. "kid X not found")
  and "not found" (covers JWKS-resolution failures). Pre-existing
  ordering invariant comment is now actually defended.

Cleanup — drop redundant single-function file:
  Moved ExtractTenantID from azure_ad/tenant.go into provider.go alongside
  other claim accessors and removed the empty tenant.go. The function
  was a 1-liner and didn't justify its own file.

Cleanup — inline audienceContains shim:
  Replaced the audienceContains() wrapper (one-liner around slices.Contains)
  with a direct call at the use site. Less indirection, same behavior.

Cleanup — middleware: simplify hasNonBearerAuth boolean expr:
  Folded the multi-line if-chain into a single boolean expression. Same
  semantics, less noise.

audit findings deferred as nits, not fixed:
- aws_sigv4 Parser as zero-value struct (cosmetic; keeps symmetry)
- egress_step.go hostOf manual URL parsing (cosmetic; non-hot path)
- 10k eviction comment wording

audit findings confirmed not bugs:
- GraphCache TTL test (already exists in graph_cache_test.go)
- PrependChain loopback invariant intact (runner.go line 2036)
The Phase 2 provider docs were committed as MD files under docs/auth/
but we don't want to version-control them — the source-of-truth lives
in the design folder, and we'll deliver via the doc site separately.

- .gitignore: add docs/auth/
- git rm --cached docs/auth/** (local files preserved)
- README.md: drop the now-broken "Auth Providers" docs row
- CHANGELOG.md: drop the docs/auth/*.md links from the v0.11.0 entry

No code or test changes.
…ontract

Real-AWS testing surfaced a documentation gap: callers cannot use raw
`awscurl` / `aws-sdk-go` against Forge's `aws_sigv4` provider because
Sigv4 binds the signature to the destination host. Standard tools sign
for the URL they're addressing (Forge) — STS then rejects the reflected
signature because the host bytes don't match.

The server-side code is correct. The client just needs to sign a
hypothetical STS request, then attach the resulting headers to its real
POST to Forge. Same pattern as aws-iam-authenticator for EKS.

This commit:
- Ships `scripts/forge-aws-sign.py`, a ~100 LOC reference client using
  boto3.session + SigV4Auth. CLI flags for --region, --url, --profile,
  --body, --verbose. Reads SSO/IRSA/profile/env credentials via boto3's
  standard chain.
- Extends the package-level docstring in
  `forge-core/auth/providers/aws_sigv4/provider.go` with a
  "Client-side signing contract" section spelling out the 4-step pattern
  and pointing readers to the reference script.
- Adds a "Client-side requirement" section to CHANGELOG.md so adopters
  know to grab the helper or write their own before integrating.

Validated against real AWS:
- STS reflection: 200, identity stamped, correct ARN/Account/UserID
- ARN allowlist match: 200 (matching pattern)
- ARN allowlist miss: 401 reason=rejected (correct authz gate)
- No-auth: 401 reason=missing_token (Phase 1 contract preserved)
…uthenticator)

Phase 2 PR 2's original "reflect Sigv4 headers" design was broken in
the obvious way: Sigv4 binds its signature to the destination host as
part of the canonicalized signing input. Headers signed for Forge's
host could not be replayed against STS — STS sees host:sts.<region>.
amazonaws.com, recomputes the signature, gets a different hash,
rejects with "SignatureDoesNotMatch". Caught during real-AWS smoke;
documented in PR initializ#79 description.

This commit replaces the pattern with the same approach
aws-iam-authenticator uses for EKS:

  Client (3 lines):
    url   = boto3.client('sts').generate_presigned_url('get_caller_identity', ExpiresIn=900)
    token = 'forge-aws-v1.' + base64.urlsafe_b64encode(url.encode()).rstrip(b'=').decode()
    requests.post(forge_url, headers={'Authorization': f'Bearer {token}'}, ...)

  Server:
    Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url>
    → decode + validate host (SSRF guard) + GET on the URL → STS → identity

Net effect on caller experience: identical to JWT/OIDC/azure_ad —
"mint token, send Bearer, done." Three lines of client code, hidden in
~15 lines of any AWS SDK in any language.

what changed:

  forge-core/auth/providers/aws_sigv4/
    sigv4_parser.go  — was parsing AWS4-HMAC-SHA256 Authorization header
                       now parses forge-aws-v1.<base64-url> Bearer tokens
                       (URL host validation, SSRF guard, X-Amz-Credential
                       parsing for cache key derivation)
    sts_client.go    — was POST with reflected headers
                       now GET on the pre-signed URL; same 200/4xx/5xx
                       classification and 64 KiB body cap
    provider.go      — Verify() now reads the Bearer token (not raw
                       headers); SSRF guard via expectedHost field;
                       same cache + ARN allowlist semantics

  forge-core/auth/
    provider.go      — HeadersFromRequest reverts X-Amz-Date and
                       X-Amz-Security-Token (no longer needed); keeps
                       X-Goog-Iap-Jwt-Assertion for gcp_iap
    provider.go      — TokenKind detects "forge-aws-v1." prefix → "sigv4"
                       (was: "AWS4-HMAC-SHA256 " on raw Authorization)
    middleware.go    — simplify: empty-Bearer fallback only handles IAP
                       (aws_sigv4 rides standard Bearer flow now)

  scripts/forge-aws-sign.py — rewrite as a clean reference client.
    --token-only: print just the token for use with curl/other tools
    Otherwise: do the round-trip POST and print the response

  CHANGELOG.md — replace "client wrapper required" friction note with
    the 3-line happy path snippet

what stays unchanged:

  - forge.yaml shape (still type: aws_sigv4, region:, allowed_principals:)
  - identity_cache.go, arn_matcher.go (cache and authz logic untouched)
  - security.AuthDomains (sts.<region>.amazonaws.com derivation)
  - forge-cli/cmd/init* flag set and renderer
  - validate.ValidateAuthConfig (region still required)
  - forge-ui/handlers_create.go (AuthProviderTypeMeta entry)

Tests: 42 packages pass, golangci-lint v2.10.1 clean, gofmt clean,
no aws-sdk-go imports (decision §9.1 still holds).

Net diff: +732 / -625 lines (mostly test rewrites; ~80 LOC net less
in the provider package because the new flow is structurally simpler).
…n client

Two correctness fixes surfaced by live AWS testing of the pre-signed URL
pattern from b3444c2.

1. Preserve the raw URL byte-for-byte.

Round-tripping the presigned URL through Go's net/url package
re-encoded query params in subtle ways (e.g. "/" in X-Amz-Credential,
"+" inside X-Amz-Security-Token) that didn't match how the AWS SDK
emitted them on the caller side. STS recomputes the canonical request
using whatever bytes we send and gets a different hash → 4xx
SignatureDoesNotMatch → audit reason "rejected".

  - PresignedToken gains a RawURL field — the exact bytes from the
    decoded token payload.
  - The parsed *url.URL is kept ONLY for SSRF host validation and
    query-param inspection. It is NEVER used to construct the outbound
    request.
  - Provider.Verify now passes parsed.RawURL to STSClient.GetCallerIdentity.

2. Use SigV4QueryAuth directly in the reference client (not boto3's
   high-level generate_presigned_url).

boto3.client('sts').generate_presigned_url('get_caller_identity', ...)
produces a URL STS rejects with SignatureDoesNotMatch when GET. Known
quirk — the high-level presigner signs as if the request were a POST.
aws-iam-authenticator works around this by signing the AWSRequest
explicitly; scripts/forge-aws-sign.py now does the same:

    req = AWSRequest(method='GET', url='https://sts.{region}.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15')
    SigV4QueryAuth(creds, 'sts', region, expires=900).add_auth(req)
    token = 'forge-aws-v1.' + base64.urlsafe_b64encode(req.url.encode()).rstrip(b'=').decode()

Live validation against real AWS (account 412664885516, SSO assumed-role):
- Happy path: HTTP 400 body-shape error + auth_verify with correct ARN
- Deny path:  HTTP 401 + auth_fail reason="rejected" + token_kind="sigv4"

42 packages still pass; golangci-lint clean; gofmt clean.

(Known follow-up surfaced but out of scope: hot-reload of forge.yaml
doesn't rebuild the auth chain, so allowlist changes require a hard
restart. Same caveat affects all providers, not just aws_sigv4.)
@naveen-kurra naveen-kurra requested a review from initializ-mk May 24, 2026 15:58
Naveen Kurra added 2 commits May 24, 2026 12:02
Two paired additions that make Phase 2 provider onboarding actually
usable end-to-end via the TUI.

## allowed_accounts (ergonomic shortcut)

aws_sigv4.Config gains AllowedAccounts []string. Each 12-digit AWS
account ID expands at Factory time into the canonical glob set
covering every STS identity shape:

  arn:aws:iam::<acct>:user/*           — direct IAM users
  arn:aws:iam::<acct>:role/*           — direct IAM roles
  arn:aws:sts::<acct>:assumed-role/*/* — SSO, AssumeRole, IRSA
  arn:aws:sts::<acct>:federated-user/* — SAML / web-identity federation

So an operator who wants "anyone in this account" writes one line of
config instead of four globs. Composes with allowed_principals — list
specific roles AND whole accounts in the same provider entry.

Validation:
  - validateAccountID checks the 12-digit shape at Factory time
  - validate.ValidateAuthConfig catches malformed entries at
    `forge validate` time (before scaffold writes forge.yaml)

## TUI sub-flows for all three Phase 2 providers

forge init's TUI picker now has 7 entries (was 4):
  None / OIDC / HTTP Verifier / AWS Sigv4 / GCP IAP / Azure AD / Custom

Input flows:
  aws_sigv4   region → audience (opt) → accounts (opt) → done
  gcp_iap     audience → done
  azure_ad    tenant → audience → done  (single-tenant only)

Azure AD intentionally restricts the TUI to single-tenant. Enabling
allow_multi_tenant is a deliberate security trade-off (any Entra tenant
in the world is admissible) and should require editing forge.yaml, not
clicking through a wizard.

Egress hosts auto-computed from the selection:
  aws_sigv4 → sts.<region>.amazonaws.com
  gcp_iap   → www.gstatic.com  (hardcoded §9.4)
  azure_ad  → login.microsoftonline.com

So the Egress review step (which runs after Auth per 639bfa9) shows
the operator the full outbound surface they're about to allow,
including the auth-provider's STS / IAP / AAD endpoints.

## Other surfaces

forge init --auth-aws-allowed-account flag (repeatable) — matches the
non-interactive path the wizard takes internally.

CHANGELOG: drops the "TUI deferred" note (no longer deferred), adds
sections for allowed_accounts and the TUI sub-flows.

## Tests

  arn_matcher_test.go     +3 tests (validateAccountID, expansion shapes,
                                    expanded patterns match realistic ARNs)
  provider_test.go        +4 tests (AllowedAccounts happy path, deny path,
                                    factory rejects malformed, mix with
                                    AllowedPrincipals)
  step_auth_test.go       +6 tests (AWS full flow, AWS optional skips,
                                    AWS bad account rejection, GCP flow,
                                    AAD flow, Phase 2 summaries)
  validate/auth_test.go   updated validator covers allowed_accounts

42 packages pass go test -race -count=1; golangci-lint v2.10.1 clean;
gofmt clean.
Live wizard test surfaced a YAML emission bug: AWS account IDs like
"412664885516" were rendered unquoted, e.g.

    allowed_accounts:
      - 412664885516

yaml.v3 decodes that as !!int → cannot unmarshal into []string in
the aws_sigv4 provider's Config.AllowedAccounts field → provider
construction fails at startup.

needsYAMLQuoting now treats any all-digit string as requiring
quotes. The renderer emits the correct form:

    allowed_accounts:
      - "412664885516"

Fix is general — applies to anything in the auth-settings schema
that's a digit string (ZIP-shaped IDs, version segments, etc.) so
this same bug class can't surface for a new provider later.

Live-validated after the fix:
- auth_verify on the matching allowed_accounts (happy)
- auth_fail reason=rejected on a wrong account (deny)
@initializ-mk
Copy link
Copy Markdown
Contributor

Phase 2 review — three providers, focused on the outbound-HTTP defense-in-depth gaps

I ran a security-focused audit across the four surfaces: aws_sigv4, gcp_iap, azure_ad, and the middleware/CLI/UI wiring. Tests pass with -race across all 9 affected packages, golangci-lint clean across all three modules, gofmt clean. The crypto choices, sentinel-error contract, loopback prepend invariant, and SkipIssuerCheck YAML-unreachability all hold up.

The pattern in the BLOCKERs below is the same in three places: the outbound *http.Client is constructed with &http.Client{Timeout: …} and nothing else. That leaves Go's default redirect policy (follow up to 10), which silently undermines the SSRF / same-host guard that each provider does at the application layer. The repo already has security.SafeRedirectPolicy(10) and uses it in the webhook/http_request paths — wire it (or http.ErrUseLastResponse) into these three clients.


BLOCKERS (please fix before merge)

B1. azure_ad/graph_client.go:138ensureGraphHost accepts http:// nextLink
ensureGraphHost only compares got.Host; the scheme is not checked. A @odata.nextLink: \"http://graph.microsoft.com/...\" passes the gate. The next iteration sets Authorization: <caller's bearer> and sends it in plaintext. Go strips Authorization on cross-host redirects but NOT on cross-scheme downgrades to the same host. Fix: also require got.Scheme == \"https\" (or == c.endpointScheme so test mode still works).

B2. azure_ad/graph_client.go:51 — Graph client follows arbitrary HTTP redirects
&http.Client{Timeout: timeout} uses the default CheckRedirect. The application-layer ensureGraphHost only runs on @odata.nextLink (line 73), not on HTTP 301/302/307. A 302 from Graph (or a MITM with a cert that ever cracks) would silently follow, parse an attacker body, and yield to JSON unmarshal. Fix: set CheckRedirect: func(*http.Request, []*http.Request) error { return http.ErrUseLastResponse } — Graph's nextLink mechanism makes HTTP redirects unnecessary.

B3. aws_sigv4/sts_client.go:33 — STS client follows arbitrary HTTP redirects
Same root cause. The parser-side host validation (sigv4_parser.go:88) only covers the first hop; once http.Client.Do follows a redirect off sts.<region>.amazonaws.com, an attacker-controlled body becomes the parsed STS XML and controls Identity.UserID / OrgID / Arn. The token-host gate is exactly the SSRF guard the design relies on; redirect-following silently invalidates it. Fix: pin CheckRedirect to ErrUseLastResponse on this client.


MAJOR (should land in this PR or a fast follow-up)

M1. aws_sigv4/sigv4_parser.go:88 — no rejection of URL userinfo
https://user:pass@sts.us-east-1.amazonaws.com/?... passes the EqualFold(u.Host, expectedHost) check (u.Host excludes userinfo per RFC 3986). Forge then ships Basic-auth bytes from an attacker-controlled token to STS. STS ignores it, but it's still attacker bytes leaving the box. Add if u.User != nil { return error }.

M2. aws_sigv4/sigv4_parser.go — no X-Amz-Date freshness or X-Amz-Expires sanity
STS enforces a ~15min window, but the cache key is hash(AKID|YYYYMMDD), so a stolen-token replay window relies entirely on STS's own enforcement. Reject X-Amz-Expires > 900 and an X-Amz-Date older than ~15min parser-side as belt-and-braces.

M3. azure_ad/provider.go:153 — multi-tenant has no allowed_tenants allow-list
When allow_multi_tenant: true, both the iss check (via SkipIssuerCheck) AND the tid check are suppressed — any Entra tenant in the world verifies. This is the documented trade-off, but the spec at resolveIssuer line 181 says "azure_ad enforces tenant via the tid claim instead," which is misleading: in multi-tenant mode there's no tid enforcement at all. Either add an allowed_tenants: []string knob, or fix the doc to be loud about "any tenant globally."

M4. forge-core/auth/middleware.go:110token_kind=\"iap_jwt\" only recorded when Bearer is missing
The kind := TokenKind(token) runs first; the IAP upgrade only triggers when kind == \"empty\". A request with a real Bearer JWT plus an IAP assertion records kind=\"jwt\", masking IAP-fronted traffic in audit dashboards. Audit-pipeline goal of counting IAP traffic distinctly is defeated. Fix: upgrade to iap_jwt whenever the IAP header is present AND the chain verifies via the gcp_iap provider.

M5. forge-core/validate/auth.go:64validateProviderSettings permissive on unknown keys
A typo like aud: instead of audience: slips through silently — asString(settings, \"audience\") returns "" and the required-keys check misses it. Combined with handlers_create.go forwarding Settings unfiltered into the on-disk scaffold, the Web UI POST {\"settings\": {\"audience\": \"x\", \"evil_key\": \"y\"}} produces a forge.yaml with evil_key in it. Today every field has yaml:\"-\" discipline so this is harmless — but it's one missed tag away from being exploitable. Add a closed-key whitelist + warn on unknown.


NITs / follow-ups (don't block merge)

  • gcp_iap/iap_jwks.go:285classifyJWTErr substring matching is brittle. Prefer errors.Is(err, jwt.ErrTokenExpired) etc. via golang-jwt v5's named sentinels.
  • gcp_iap/iap_jwks.go:213j.keys = keys unconditionally replaces during refresh. GCP rotates with overlap so today this is fine, but a partial-but-valid JWKS body (>=1 key but missing an in-flight kid) would drop kids the stale-grace contract assumes are kept. Consider merge-on-success or at least logging the diff.
  • azure_ad/graph_cache.go:33Get returns the cached slice without copy; any caller that mutates id.Groups mutates the cache.
  • forge-cli/cmd/init_auth.go:338needsYAMLQuoting catches all-digit (the 3364ca8 fix) but misses scientific-notation (1e10), hex (0x10), leading-zero (010), and .inf/.nan. Auth-setting values rarely look like these, but the docstring says "false negatives are bugs."
  • aws_sigv4/identity_cache_test.go:69string(rune(i)) for surrogate code points yields \\ufffd and collides keys. Use strconv.Itoa(i) so the threshold test actually reaches 10,001.
  • Several files: req, _ := http.NewRequestWithContext(...) discards the error. Hardcoded URLs make the panic risk currently zero, but it's an antipattern in security-critical code.
  • gcp_iap/provider_test.go — add an HS256-with-EC-public-key-as-secret test case (most dangerous alg-confusion shape, currently only RS256 covered).

Verdict

Needs revision before merge — three BLOCKERs (B1–B3) are each 1–3 line CheckRedirect/scheme-check fixes plus tests. None require architectural change; all three undermine the application-layer SSRF / same-host guards that the design correctly identifies as the security boundary. The MAJORs (M1–M5) are higher signal-to-effort fixes and should land alongside.

What's not a concern:

  • SkipIssuerCheck is genuinely unreachable from YAML (grep + behavior verified).
  • Headers.Get is case-insensitive (forge-core/auth/provider.go:77); the lowercase-IAP-header concern is cleared.
  • ✅ Loopback static_token PrependChain still wraps Phase 2 providers; localhost dashboard works.
  • ✅ Phase 1 backward compat — zero forge.yaml changes required, Phase 1 tests pass unmodified.
  • ✅ STS XML root-element enforcement (Go's encoding/xml honors the XMLName xml.Name tag).

Live AWS validation in the PR description is excellent — that level of evidence for the happy and deny paths is exactly the right bar. Once the three redirect fixes land, this is ready.

@initializ-mk
Copy link
Copy Markdown
Contributor

@naveen-kurra also run /sync-docs command/plugin for Claude to sync the document updates based on the changes

Naveen Kurra added 8 commits May 24, 2026 15:16
Phase 2 review (PR initializ#79) flagged three same-root-cause issues across the
outbound HTTP clients in azure_ad, aws_sigv4, and gcp_iap:

- B1: ensureGraphHost accepted http:// nextLinks — caller's Bearer would
  leak in plaintext (Go strips Authorization on cross-host redirect but
  NOT on https→http same-host downgrade).
- B2: Graph client used Go's default redirect policy (follow up to 10).
  ensureGraphHost only validates @odata.nextLink, not HTTP 301/302/307.
  A redirect from Graph → attacker URL would let the attacker control
  the JSON body that becomes Identity.Groups (latent until Phase 4
  authz lands; closing at the boundary now).
- B3: aws_sigv4's STS client had the same default-redirect issue.
  The parser-side same-host gate covers only the first hop; a redirect
  off sts.<region>.amazonaws.com would let attacker bytes become the
  parsed STS XML and control Identity.UserID/OrgID/Arn.

Verified all three are real (Go default CheckRedirect follows up to 10
per net/http docs). Smoking gun confirmed in five outbound clients
across forge-core/auth/providers/:

  aws_sigv4/sts_client.go        ← fixed (B3)
  gcp_iap/iap_jwks.go            ← fixed (not in review; same pattern,
                                    same blast radius — attacker JWKS
                                    means forged tokens verify)
  azure_ad/graph_client.go       ← fixed (B1+B2)
  oidc/provider.go               ← Phase 1 code, NOT touched here;
                                    OIDC discovery sometimes legitimately
                                    redirects, needs design discussion.
                                    Filed as follow-up on issue initializ#80.
  httpverifier/provider.go       ← Phase 1 code, NOT touched here;
                                    operator-configured URL may be
                                    behind LB. Follow-up on issue initializ#80.

Fix shape (uniform across the three Phase 2 providers):

  http.Client{
    Timeout: timeout,
    CheckRedirect: func(*http.Request, []*http.Request) error {
      return http.ErrUseLastResponse
    },
  }

These three endpoints (STS GetCallerIdentity, IAP JWKS, Graph
transitiveMemberOf) NEVER legitimately issue 3xx — the auth contract
each provider validates is bound to the configured host. Returning
ErrUseLastResponse causes the 3xx response to flow into our existing
status-code switch, where the "unexpected status" default arm maps it
to ErrProviderUnavailable (and the audit logs "provider_unavailable",
not a misleading "rejected").

B1 specific (separate from the redirect-policy change):

  ensureGraphHost now checks both Host AND Scheme. GraphClient stores
  endpointScheme alongside endpointHost, pre-parsed at construction so
  the per-page check stays cheap. Test mode (httptest's http://) still
  works because the configured scheme is matched against, not hard-
  coded https.

Regression tests added (one per provider):

  TestSTSClient_DoesNotFollowRedirects
  TestJWKSCache_DoesNotFollowRedirects
  TestGraphClient_DoesNotFollowRedirects

Each spins up an httptest server that returns 302 → attacker URL,
runs the client, asserts:
  - error returned (not silently followed)
  - error is ErrProviderUnavailable (correct audit reason)
  - endpoint hit exactly once (counts redirect-follows)

Plus two scheme-specific cases for B1:
  TestEnsureGraphHost_RejectsSchemeDowngrade   — https→http same-host
  TestEnsureGraphHost_TestModeHTTPOK           — httptest http stays OK

42 packages pass go test -race -count=1; golangci-lint v2.10.1 clean;
gofmt clean.
Phase 2 review M1: a token whose pre-signed URL contained userinfo
(https://user:pass@sts.us-east-1.amazonaws.com/...) would pass our
EqualFold(u.Host, expectedHost) check because RFC 3986 separates
userinfo from host — net/url puts "user:pass" into u.User and
"sts.us-east-1.amazonaws.com" into u.Host independently.

http.Client.Do then synthesizes Authorization: Basic <b64(user:pass)>
from u.User and ships those attacker-controlled bytes to STS. STS
ignores Basic (it validates the X-Amz-Signature in the query string),
so this isn't an active auth bypass — but attacker bytes still leave
our box, with potential for exfiltration via STS access logs / timing.

Fix: reject u.User != nil at parse time, BEFORE the host check, in
sigv4_parser.go's ParseToken. One line; defense-in-depth.

Regression test TestParseToken_RejectsUserinfo confirms the userinfo
case is now caught and that the error message mentions "userinfo"
(so future maintainers know what tripped).

42 packages still pass; lint + gofmt clean.
Phase 2 review M2: STS enforces ~15min on X-Amz-Date + X-Amz-Expires
server-side, but our IdentityCache (60s default TTL) would happily
serve a cached Identity even if STS somehow accepted a stale token.
The cache key being hash(AKID|YYYYMMDD) gave the impression of
day-long replay potential — it doesn't, because the 60s TTL bounds
it — but the parser had zero own-source freshness check, so the
stolen-token replay window depended entirely on STS's enforcement.

Adding parser-side freshness as defense-in-depth:

PresignedToken gains two fields:
  SigTime time.Time      ← parsed from X-Amz-Date
  Expires time.Duration  ← parsed from X-Amz-Expires

ParseToken now requires both query params (rejects missing /
malformed / non-positive Expires). Parsing stays pure: no clock
comparison happens inside ParseToken.

PresignedToken.CheckFreshness(now, maxExpires, skew) gates:
  - X-Amz-Expires > maxExpires       → cap (default 15min — matches
                                       what all standard AWS SDKs
                                       emit for GetCallerIdentity)
  - now > SigTime + Expires + skew   → token expired
  - SigTime > now + skew             → token from the future

skew default 5min for normal clock drift.

Config additions (both default-able, no operator action needed):
  MaxTokenExpires  time.Duration  `yaml:"max_token_expires,omitempty"`
  ClockSkew        time.Duration  `yaml:"clock_skew,omitempty"`

Provider gains a `now func() time.Time` field, default time.Now,
overridable in tests so freshness can be exercised without clock
monkey-patching.

Provider.Verify calls CheckFreshness right after region check, before
the cache lookup — STS round-trip and identity cache hit are both
short-circuited when freshness fails. ErrTokenRejected, audit reason
"rejected".

Tests:
  parser-level (sigv4_parser_test.go):
    TestParseToken_PopulatesSigTimeAndExpires
    TestParseToken_RejectsMissingAmzDate
    TestParseToken_RejectsMalformedAmzDate
    TestParseToken_RejectsMissingAmzExpires
    TestParseToken_RejectsNonNumericExpires
    TestParseToken_RejectsNonPositiveExpires
    TestCheckFreshness_Expired
    TestCheckFreshness_FromTheFuture
    TestCheckFreshness_ExceedsExpiresCap
    TestCheckFreshness_HappyPathInsideSkew

  provider-level (provider_test.go):
    TestProvider_RejectsExpiredToken
    TestProvider_RejectsTokenFromFuture
    TestProvider_RejectsOverlyLongExpiresClaim
    TestProvider_AcceptsTokenAtEdgeOfSkewWindow

Existing test fixtures updated:
  tokenFor() now stamps X-Amz-Date built from fixedTestTime so
  freshness passes by default; newTestProvider pins Provider.now
  to that same instant. Day-bucket rollover test advances the
  clock by 24h to match its day-2 token.

42 packages pass, lint + gofmt clean.
Phase 2 review M4: structural token_kind detection runs BEFORE the
chain, so a request with both a Bearer JWT and an X-Goog-Iap-Jwt-
Assertion records kind="jwt" — masking IAP-fronted traffic in audit
dashboards even when gcp_iap was the actual verifier.

The structural rule answers "what bytes were on the wire?" The
audit signal we want answers "which auth path verified?" Those
diverge only when a Bearer is present alongside a non-Bearer
provider's payload — today, gcp_iap is the only such provider.

Fix:

  identity, err := chain.Verify(...)
  if err == nil {
    kind = refineTokenKind(kind, identity.Source)
  }

refineTokenKind:
  - Source == "gcp_iap"  →  return "iap_jwt"
  - otherwise            →  return structural kind unchanged

Other providers (oidc, azure_ad, aws_sigv4, http_verifier,
static_token) don't need refinement: their structural kind
already matches the auth path that verifies them.

Failure paths still record the structural kind (no chain identity
to read Source from), so the existing "missing_token" /
"not_for_me" audit reason-code contract is unaffected.

Tests:
  TestMiddleware_TokenKind_RefinedToIapJwtWhenGCPIAPVerifies
    — Bearer JWT + IAP header both present; chain stubbed to
      return Source="gcp_iap"; assert kind upgrades to iap_jwt
  TestMiddleware_TokenKind_JWT_NotRefinedForOIDCProviders
    — counter-test: Source="oidc" must NOT trigger the refinement
…nt doc (M3)

Phase 2 review M3:
  1. When allow_multi_tenant: true, both the iss check (via composed
     oidc.Provider's SkipIssuerCheck) AND the tid check were
     suppressed — any Entra tenant in the world verified.
  2. resolveIssuer()'s docstring said "azure_ad enforces tenant via
     the tid claim instead" which was misleading: in multi-tenant
     mode there was no tid enforcement at all.

Fix is two-part: add the missing knob, fix the doc.

allowed_tenants (Config.AllowedTenants []string)

  Optional allowlist of Entra tenant GUIDs matched against the JWT
  `tid` claim. Only meaningful with allow_multi_tenant=true.

  Three operational modes:
    single-tenant (default)                      tid MUST equal TenantID
    multi-tenant + AllowedTenants set            tid MUST be in list
    multi-tenant + AllowedTenants empty          no tid check ("any
                                                 tenant globally" —
                                                 documented high-risk
                                                 shape, opted into by
                                                 deliberately omitting)

  Match is case-insensitive (Entra emits lowercase GUIDs; operators
  often paste uppercase from the portal).

  Factory-time validation rejects allowed_tenants in single-tenant
  mode (TenantID is THE gate there; the combination would silently
  degrade if not caught).

Validator + non-interactive flag + tests

  - forge-core/validate/auth.go: rejects allowed_tenants in single-
    tenant mode at validate-time; WARNS when multi-tenant+empty so
    the "any tenant globally" trade-off is loud
  - forge-cli/cmd/init.go: --auth-azure-allowed-tenant flag
    (repeatable, mirrors --auth-aws-allowed-account)
  - forge-cli/cmd/init_auth.go: buildAuthFromFlags forwards through
  - forge-cli/cmd/init_auth_test.go: mock adds the new flag

  Five new tests (provider_test.go):
    AllowedTenants_AcceptsListed
    AllowedTenants_RejectsUnlisted
    AllowedTenants_CaseInsensitive
    AllowedTenants_MissingTidRejected
    SingleTenant_WithAllowedTenants_RejectedAtFactory

resolveIssuer() docstring rewrite

  Now spells out the three operational modes explicitly, including
  the "any-tenant" shape with its security implication. No more
  "azure_ad enforces tenant via the tid claim" line that's only
  half-true.

CHANGELOG gains an "allowed_tenants" section with the canonical
recipe.

42 packages pass, lint + gofmt clean.
Phase 2 review M5: validateProviderSettings was permissive on unknown
keys. A typo like `aud:` instead of `audience:` would slip past the
required-key check silently (asString returns ""), and — more
critically — the forge-ui handler forwarded a.Settings unfiltered to
the on-disk scaffold. A POST like
  {"settings": {"audience": "x", "evil_key": "y"}}
would write evil_key into forge.yaml verbatim. Today's provider Config
structs ignore unknown YAML fields via yaml.v3's default, so it's
harmless — but one missed `yaml:"-"` tag on a future field would
suddenly make it reachable from untrusted POST input. Closed at the
boundary now.

Two-part fix:

1. forge-core/validate/auth.go
   - New exported KnownAuthProviderSettings map[string]map[string]bool
     (closed whitelist per provider type — mirrors the yaml: tags on
     each provider's Config struct, kept in sync by convention).
     Internal yaml:"-" fields are intentionally absent.
   - validateProviderSettings emits a Warning per unknown settings
     key during `forge validate`. Loose-not-strict because some
     operators stash custom annotations from pre-Phase-2 days;
     warning surfaces the typo without breaking those configs.
   - New exported FilterKnownSettings(providerType, settings)
     returns a copy with unknown keys dropped. Defense-in-depth
     filter for use at trust boundaries.

2. forge-ui/handlers_create.go
   - validateAuthPayload calls FilterKnownSettings on the incoming
     a.Settings BEFORE validation OR scaffolding. Closes the
     specific exploit chain (Web UI POST → forge.yaml) regardless
     of yaml tag discipline drift on Config structs.

Tests:
  validate package:
    TestValidateAuthConfig_WarnsOnUnknownSettingsKey
    TestValidateAuthConfig_NoWarningForKnownKeys
    TestFilterKnownSettings_DropsUnknownKeys
    TestFilterKnownSettings_UnknownProviderTypePassthrough
    TestFilterKnownSettings_AllPhase2Providers (subtests per provider/key)

  forge-ui:
    TestHandleCreateAgent_FiltersUnknownAuthSettings — end-to-end
      POST with evil_key in settings; assert it does NOT survive
      into the scaffold's captured AuthCreateOptions.Settings

42 packages still pass; lint + gofmt clean.
Batch-clearing the "don't block merge" follow-ups from review of PR initializ#79.

initializ#1 gcp_iap classifyJWTErr — use jwt v5 sentinels via errors.Is rather
   than substring matching (library wording shifts across patches;
   sentinels are public API). Special-case ErrTokenSignatureInvalid
   to split alg-confusion (→ ErrInvalidToken) from real bad-signature
   (→ ErrTokenRejected) because golang-jwt wraps both under that one
   sentinel. Three internal keyFunc message-matches retained — those
   are strings WE control, not the library's.

initializ#2 gcp_iap JWKS merge-on-success — switched j.keys = newKeys to a
   per-kid merge. A partial-but-valid JWKS response (e.g. one kid
   accidentally omitted by GCP during rotation) no longer drops kids
   the stale-grace contract assumes we still have. Worst case is
   keeping a retired kid in cache; verification still fails naturally
   for any token signed with the retired private key.

initializ#3 azure_ad GraphCache defensive copies — Get returns
   append([]string(nil), e.groups...) and Put stores a copy of its
   input. Caller mutating Identity.Groups (the auth.Identity layer
   treats it as freely mutable) can't poison the cache.

initializ#4 forge-cli needsYAMLQuoting numeric edge cases — quote anything
   that resembles a YAML number (hex 0x, octal 0o, binary 0b,
   leading-zero "010", scientific 1e10, decimal float 3.14, signed
   ±N, .inf / .nan in either case). Auth-setting values rarely hit
   these shapes but the docstring promised "false negatives are
   bugs" and the Web UI POST path can supply arbitrary strings.
   Added looksNumeric() helper with separate allHexDigits /
   allOctalDigits / allBinaryDigits gates.

initializ#5 aws_sigv4 identity_cache_test — replaced string(rune(i)) with
   strconv.Itoa(i). Surrogate code points (0xD800..0xDFFF) all map
   to U+FFFD, so the eviction-threshold test was silently building
   ~10 distinct keys instead of 10_001 and the sweep never ran.

initializ#6 http.NewRequestWithContext error handling — fixed the two
   `req, _ := ...` antipatterns in gcp_iap/iap_jwks.go and
   azure_ad/graph_client.go. Hardcoded URLs make the failure currently
   unreachable, but errcheck-clean is the discipline.

initializ#7 gcp_iap HS256-with-EC-public-key alg-confusion test — pinned the
   most dangerous attack shape: attacker fetches the verifier's public
   key from JWKS (open by design), uses raw X/Y bytes as the HMAC
   "secret", signs an HS256 token. A non-whitelisting verifier would
   HMAC-verify it. Our keyFunc rejects on alg != "ES256" BEFORE key
   lookup; this test confirms.

Tests added: TestGraphCache_GetReturnsDefensiveCopy,
TestGraphCache_PutStoresDefensiveCopy,
TestProvider_HS256WithECPublicKeyAsSecret_Rejected. Existing
TestProvider_RS256Token_Rejected still passes (alg-confusion still
classified as ErrInvalidToken under the new sentinel-based path).

42 packages green, lint + gofmt clean.
Ran the /sync-docs recipe over PR initializ#79's surface (commit 745c024). Two
pre-flight fixes before the sync:

  - .claude/commands/sync-docs.md mapping table was outdated:
    pointed at flat files (`docs/commands.md`, `docs/runtime.md`, …)
    that don't exist; real layout is nested under core-concepts/,
    reference/, security/. Rewrote the table to match the actual
    tree and added a `forge-core/auth/` → authentication.md +
    audit-logging.md row.

  - docs/auth/ is gitignored (per earlier review feedback). Did NOT
    move that content into git; the new canonical home is
    docs/security/authentication.md inside the existing tracked
    docs/security/ folder.

New doc:

  docs/security/authentication.md (423 lines) — single home for the
  full auth provider chain across Phase 1 + Phase 2. Covers:
    - Provider matrix and chain semantics (first-match-wins,
      fail-closed on reject, loopback static_token auto-prepend)
    - Per-provider sections for static_token, oidc, aws_sigv4,
      gcp_iap, azure_ad, http_verifier — including yaml shape,
      wire format, client-side recipe, security model
    - aws_sigv4 client-side gotcha (boto3.generate_presigned_url
      doesn't work for STS; use SigV4QueryAuth)
    - AWS Org-wide trust recipes (Identity Center, entry role +
      aws:PrincipalOrgID condition)
    - Egress allowlist auto-extension table per provider
    - Wizard / CLI examples
    - Mesh patterns (single-account fleet, per-pair allowlist)

Updates to satellite docs:

  docs/security/overview.md
    - Added Authentication layer to the security-architecture ASCII
      diagram (between Guardrails and Egress)
    - New "Authentication" section with provider table + chain rules
    - Authentication row in Related Documentation footer

  docs/security/audit-logging.md
    - auth_verify and auth_fail event entries in the Event Types
      table
    - Full example shape for both events
    - Reason-code table (missing_token, not_for_me, rejected,
      invalid, provider_unavailable) with operator actions
    - token_kind value table (empty, opaque, jwt, sigv4, iap_jwt)
    - Audit pipeline jq grep recipes for common queries

  docs/security/egress-control.md
    - New "Auth-provider domain auto-extension" subsection under
      Allowlist Resolution with the per-provider host table
    - Notes the wizard's Auth-before-Egress ordering

  docs/reference/forge-yaml-schema.md
    - Full auth: block inserted in the schema example with all 6
      provider types and every documented settings key

  docs/reference/cli-reference.md
    - All --auth-* flags added to the forge init flag table
    - Two new example invocations (aws_sigv4 with allowed-account,
      azure_ad multi-tenant with allowlist)
    - Link to docs/security/authentication.md

  docs/reference/web-dashboard.md
    - Auth step added to the wizard steps table (between Fallback
      and Env & Security)
    - New "Auth step" subsection listing all 7 picker options +
      the Web UI's settings-key filter behavior

  README.md
    - Authentication row added to the Security documentation table

Broken-link scan: zero broken links in tracked docs/.

42 packages still pass go test -race; lint + gofmt clean. No code
changes in this commit — docs only.
@naveen-kurra
Copy link
Copy Markdown
Collaborator Author

@naveen-kurra also run /sync-docs command/plugin for Claude to sync the document updates based on the changes

DOne

@initializ-mk
Copy link
Copy Markdown
Contributor

Re-review of fix commits — LGTM, ready to merge

Verified all 8 fix commits against the prior review. Tests pass with -race across 9 packages, lint + gofmt clean.

BLOCKERs — all cleared (774268d)

  • B1 azure_ad/graph_client.go:166-181ensureGraphHost now takes configuredScheme and requires EqualFold(scheme) match. endpointScheme pre-parsed at construction. The http:// bearer-downgrade attack is closed.
  • B2 azure_ad/graph_client.go:54-67CheckRedirect: func(...) error { return http.ErrUseLastResponse } with inline comment naming the threat model. Graph's app-layer nextLink replaces the need for HTTP redirects.
  • B3 aws_sigv4/sts_client.go:39-48 — same CheckRedirect pinning with comment explicitly enumerating threats (MITM, TLS-inspecting proxy, DNS hijack).
  • Bonus gcp_iap/iap_jwks.go — author proactively pinned the same on the JWKS client even though I didn't flag it. Same blast radius (attacker JWKS → forged tokens verify); correct defensive call.

MAJORs — all addressed

  • M1 674bf2fu.User != nil check added BEFORE the host check in sigv4_parser.go:88. TestParseToken_RejectsUserinfo pins it. Comment explains why STS ignores Basic auth but bytes still shouldn't leave the box.
  • M2 c6e0affPresignedToken gains SigTime / Expires; parser requires both query params; CheckFreshness(now, maxExpires, skew) called in Provider.Verify BEFORE the cache lookup. Config gains MaxTokenExpires (default 15min) + ClockSkew (default 5min). Provider.now func() time.Time is test-injectable. 10 new tests cover happy-path, expired, future-token, exceeds-cap, edge-of-skew.
  • M3 419f20fAllowedTenants []string added with three explicit modes (single-tenant / multi+allowlist / multi+empty="any-tenant globally"). Factory rejects allowed_tenants in single-tenant mode (catches typo). Case-insensitive match. Validator emits warning on the "any-tenant globally" shape so it's loud. --auth-azure-allowed-tenant flag (repeatable). 5 tests. resolveIssuer docstring rewritten to spell out all three modes.
  • M4 57d12a5refineTokenKind(structural, identity.Source) runs post-verify; gcp_iap → iap_jwt regardless of what Bearer was on the wire. Failure paths still use structural kind (no identity to read). Counter-test confirms oidc source does NOT trigger upgrade.
  • M5 ca92921KnownAuthProviderSettings closed whitelist per provider type. ValidateAuthConfig emits warning per unknown key. FilterKnownSettings exported as defense-in-depth filter, wired into handlers_create.go BEFORE validation/scaffolding. End-to-end test TestHandleCreateAgent_FiltersUnknownAuthSettings asserts evil_key POST doesn't survive into the scaffold.

NITs — all 7 cleared (745c024)

  • gcp_iap classifyJWTErr now uses errors.Is(err, jwt.ErrTokenExpired) etc. with a thoughtful special case for ErrTokenSignatureInvalid (which wraps BOTH alg-confusion and real bad-signature in golang-jwt v5). Substring matches retained only for keyFunc messages WE control.
  • gcp_iap JWKS now per-kid merge on success — partial-but-valid rotation responses don't drop in-flight kids.
  • azure_ad GraphCache Get/Put defensive copy via append([]string(nil), ...).
  • needsYAMLQuoting extended with looksNumeric() — hex (0x), octal (0o), binary (0b), leading-zero (010), scientific (1e10), float (3.14), signed (±N), .inf/.nan in either case.
  • identity_cache_test switched to strconv.Itoa(i) — eviction-threshold test now actually reaches 10,001 distinct keys (was collapsing surrogates to U+FFFD before).
  • Two req, _ := http.NewRequestWithContext(...) antipatterns fixed (gcp_iap + azure_ad).
  • New TestProvider_HS256WithECPublicKeyAsSecret_Rejected pins the most dangerous alg-confusion shape (attacker HMACs with the public key bytes from JWKS).

Gates

  • go test -race -count=1 — 9 packages green (auth + all 7 providers + security + validate + cli/cmd + cli/tui/steps + forge-ui)
  • golangci-lint — clean across all 3 modules
  • gofmt — clean

Verdict

LGTM — ready to merge. Every blocker, major, and nit from the prior review has a focused fix commit with regression tests. The B1-B3 commit's preemptive extension to gcp_iap/iap_jwks.go and the M3 commit's three-mode explicit operational semantic show good adversarial-thinking discipline. Nicely done.

@initializ-mk initializ-mk merged commit ea8922b into initializ:main May 24, 2026
10 checks passed
initializ-mk pushed a commit that referenced this pull request May 24, 2026
…uthenticator)

Phase 2 PR 2's original "reflect Sigv4 headers" design was broken in
the obvious way: Sigv4 binds its signature to the destination host as
part of the canonicalized signing input. Headers signed for Forge's
host could not be replayed against STS — STS sees host:sts.<region>.
amazonaws.com, recomputes the signature, gets a different hash,
rejects with "SignatureDoesNotMatch". Caught during real-AWS smoke;
documented in PR #79 description.

This commit replaces the pattern with the same approach
aws-iam-authenticator uses for EKS:

  Client (3 lines):
    url   = boto3.client('sts').generate_presigned_url('get_caller_identity', ExpiresIn=900)
    token = 'forge-aws-v1.' + base64.urlsafe_b64encode(url.encode()).rstrip(b'=').decode()
    requests.post(forge_url, headers={'Authorization': f'Bearer {token}'}, ...)

  Server:
    Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url>
    → decode + validate host (SSRF guard) + GET on the URL → STS → identity

Net effect on caller experience: identical to JWT/OIDC/azure_ad —
"mint token, send Bearer, done." Three lines of client code, hidden in
~15 lines of any AWS SDK in any language.

what changed:

  forge-core/auth/providers/aws_sigv4/
    sigv4_parser.go  — was parsing AWS4-HMAC-SHA256 Authorization header
                       now parses forge-aws-v1.<base64-url> Bearer tokens
                       (URL host validation, SSRF guard, X-Amz-Credential
                       parsing for cache key derivation)
    sts_client.go    — was POST with reflected headers
                       now GET on the pre-signed URL; same 200/4xx/5xx
                       classification and 64 KiB body cap
    provider.go      — Verify() now reads the Bearer token (not raw
                       headers); SSRF guard via expectedHost field;
                       same cache + ARN allowlist semantics

  forge-core/auth/
    provider.go      — HeadersFromRequest reverts X-Amz-Date and
                       X-Amz-Security-Token (no longer needed); keeps
                       X-Goog-Iap-Jwt-Assertion for gcp_iap
    provider.go      — TokenKind detects "forge-aws-v1." prefix → "sigv4"
                       (was: "AWS4-HMAC-SHA256 " on raw Authorization)
    middleware.go    — simplify: empty-Bearer fallback only handles IAP
                       (aws_sigv4 rides standard Bearer flow now)

  scripts/forge-aws-sign.py — rewrite as a clean reference client.
    --token-only: print just the token for use with curl/other tools
    Otherwise: do the round-trip POST and print the response

  CHANGELOG.md — replace "client wrapper required" friction note with
    the 3-line happy path snippet

what stays unchanged:

  - forge.yaml shape (still type: aws_sigv4, region:, allowed_principals:)
  - identity_cache.go, arn_matcher.go (cache and authz logic untouched)
  - security.AuthDomains (sts.<region>.amazonaws.com derivation)
  - forge-cli/cmd/init* flag set and renderer
  - validate.ValidateAuthConfig (region still required)
  - forge-ui/handlers_create.go (AuthProviderTypeMeta entry)

Tests: 42 packages pass, golangci-lint v2.10.1 clean, gofmt clean,
no aws-sdk-go imports (decision §9.1 still holds).

Net diff: +732 / -625 lines (mostly test rewrites; ~80 LOC net less
in the provider package because the new flow is structurally simpler).
initializ-mk pushed a commit that referenced this pull request May 24, 2026
Phase 2 review (PR #79) flagged three same-root-cause issues across the
outbound HTTP clients in azure_ad, aws_sigv4, and gcp_iap:

- B1: ensureGraphHost accepted http:// nextLinks — caller's Bearer would
  leak in plaintext (Go strips Authorization on cross-host redirect but
  NOT on https→http same-host downgrade).
- B2: Graph client used Go's default redirect policy (follow up to 10).
  ensureGraphHost only validates @odata.nextLink, not HTTP 301/302/307.
  A redirect from Graph → attacker URL would let the attacker control
  the JSON body that becomes Identity.Groups (latent until Phase 4
  authz lands; closing at the boundary now).
- B3: aws_sigv4's STS client had the same default-redirect issue.
  The parser-side same-host gate covers only the first hop; a redirect
  off sts.<region>.amazonaws.com would let attacker bytes become the
  parsed STS XML and control Identity.UserID/OrgID/Arn.

Verified all three are real (Go default CheckRedirect follows up to 10
per net/http docs). Smoking gun confirmed in five outbound clients
across forge-core/auth/providers/:

  aws_sigv4/sts_client.go        ← fixed (B3)
  gcp_iap/iap_jwks.go            ← fixed (not in review; same pattern,
                                    same blast radius — attacker JWKS
                                    means forged tokens verify)
  azure_ad/graph_client.go       ← fixed (B1+B2)
  oidc/provider.go               ← Phase 1 code, NOT touched here;
                                    OIDC discovery sometimes legitimately
                                    redirects, needs design discussion.
                                    Filed as follow-up on issue #80.
  httpverifier/provider.go       ← Phase 1 code, NOT touched here;
                                    operator-configured URL may be
                                    behind LB. Follow-up on issue #80.

Fix shape (uniform across the three Phase 2 providers):

  http.Client{
    Timeout: timeout,
    CheckRedirect: func(*http.Request, []*http.Request) error {
      return http.ErrUseLastResponse
    },
  }

These three endpoints (STS GetCallerIdentity, IAP JWKS, Graph
transitiveMemberOf) NEVER legitimately issue 3xx — the auth contract
each provider validates is bound to the configured host. Returning
ErrUseLastResponse causes the 3xx response to flow into our existing
status-code switch, where the "unexpected status" default arm maps it
to ErrProviderUnavailable (and the audit logs "provider_unavailable",
not a misleading "rejected").

B1 specific (separate from the redirect-policy change):

  ensureGraphHost now checks both Host AND Scheme. GraphClient stores
  endpointScheme alongside endpointHost, pre-parsed at construction so
  the per-page check stays cheap. Test mode (httptest's http://) still
  works because the configured scheme is matched against, not hard-
  coded https.

Regression tests added (one per provider):

  TestSTSClient_DoesNotFollowRedirects
  TestJWKSCache_DoesNotFollowRedirects
  TestGraphClient_DoesNotFollowRedirects

Each spins up an httptest server that returns 302 → attacker URL,
runs the client, asserts:
  - error returned (not silently followed)
  - error is ErrProviderUnavailable (correct audit reason)
  - endpoint hit exactly once (counts redirect-follows)

Plus two scheme-specific cases for B1:
  TestEnsureGraphHost_RejectsSchemeDowngrade   — https→http same-host
  TestEnsureGraphHost_TestModeHTTPOK           — httptest http stays OK

42 packages pass go test -race -count=1; golangci-lint v2.10.1 clean;
gofmt clean.
initializ-mk pushed a commit that referenced this pull request May 24, 2026
Batch-clearing the "don't block merge" follow-ups from review of PR #79.

#1 gcp_iap classifyJWTErr — use jwt v5 sentinels via errors.Is rather
   than substring matching (library wording shifts across patches;
   sentinels are public API). Special-case ErrTokenSignatureInvalid
   to split alg-confusion (→ ErrInvalidToken) from real bad-signature
   (→ ErrTokenRejected) because golang-jwt wraps both under that one
   sentinel. Three internal keyFunc message-matches retained — those
   are strings WE control, not the library's.

#2 gcp_iap JWKS merge-on-success — switched j.keys = newKeys to a
   per-kid merge. A partial-but-valid JWKS response (e.g. one kid
   accidentally omitted by GCP during rotation) no longer drops kids
   the stale-grace contract assumes we still have. Worst case is
   keeping a retired kid in cache; verification still fails naturally
   for any token signed with the retired private key.

#3 azure_ad GraphCache defensive copies — Get returns
   append([]string(nil), e.groups...) and Put stores a copy of its
   input. Caller mutating Identity.Groups (the auth.Identity layer
   treats it as freely mutable) can't poison the cache.

#4 forge-cli needsYAMLQuoting numeric edge cases — quote anything
   that resembles a YAML number (hex 0x, octal 0o, binary 0b,
   leading-zero "010", scientific 1e10, decimal float 3.14, signed
   ±N, .inf / .nan in either case). Auth-setting values rarely hit
   these shapes but the docstring promised "false negatives are
   bugs" and the Web UI POST path can supply arbitrary strings.
   Added looksNumeric() helper with separate allHexDigits /
   allOctalDigits / allBinaryDigits gates.

#5 aws_sigv4 identity_cache_test — replaced string(rune(i)) with
   strconv.Itoa(i). Surrogate code points (0xD800..0xDFFF) all map
   to U+FFFD, so the eviction-threshold test was silently building
   ~10 distinct keys instead of 10_001 and the sweep never ran.

#6 http.NewRequestWithContext error handling — fixed the two
   `req, _ := ...` antipatterns in gcp_iap/iap_jwks.go and
   azure_ad/graph_client.go. Hardcoded URLs make the failure currently
   unreachable, but errcheck-clean is the discipline.

#7 gcp_iap HS256-with-EC-public-key alg-confusion test — pinned the
   most dangerous attack shape: attacker fetches the verifier's public
   key from JWKS (open by design), uses raw X/Y bytes as the HMAC
   "secret", signs an HS256 token. A non-whitelisting verifier would
   HMAC-verify it. Our keyFunc rejects on alg != "ES256" BEFORE key
   lookup; this test confirms.

Tests added: TestGraphCache_GetReturnsDefensiveCopy,
TestGraphCache_PutStoresDefensiveCopy,
TestProvider_HS256WithECPublicKeyAsSecret_Rejected. Existing
TestProvider_RS256Token_Rejected still passes (alg-confusion still
classified as ErrInvalidToken under the new sentinel-based path).

42 packages green, lint + gofmt clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants