feat(auth): Phase 2 — AWS Sigv4, GCP IAP, Azure AD providers (v0.11.0)#79
Conversation
…r 1) HeadersFromRequest gains Authorization, X-Goog-Iap-Jwt-Assertion, X-Amz-Date, X-Amz-Security-Token so future providers consuming non-Bearer formats (aws_sigv4, gcp_iap) can read what they need without changing the Provider.Verify signature. TokenKind recognizes the "AWS4-HMAC-SHA256 " prefix and returns "sigv4", so audit logs can distinguish Sigv4 requests from "empty" even though the Bearer extractor returns "". Middleware now consults the chain even when no Bearer token was extracted, provided a non-Bearer auth header is present (Sigv4 Authorization or IAP assertion). When NO auth headers at all are present, the audit reason still resolves to ErrMissingBearer — preserving review initializ#4's stable "missing_token" reason code. Phase 1 providers see zero behavior change; their Verify path is unchanged. All Phase 1 tests pass without modification.
…e 2 pr 2) aws_sigv4 authenticates AWS-IAM callers by reflecting their Sigv4 signature to STS GetCallerIdentity. No aws-sdk-go-v2 dependency (decision §9.1): the STS RPC is ~150 LOC of hand-rolled HTTP + XML. Forge never holds the caller's secret key — STS validates the signature on Forge's behalf. Key pieces: - sigv4_parser.go: pure string parser, fuzz-tested, never panics - sts_client.go: 200/4xx/5xx classification per review initializ#6 contract - identity_cache.go: hash(AKID|YYYYMMDD)-keyed TTL cache, opportunistic eviction past 10k entries, Put does NOT extend prior expiry - arn_matcher.go: shell-style globs via path.Match (decision §9.3), invalid patterns fail at Factory time - provider.go: scope check (service=sts, region match) before any STS round trip, cache hit avoids RPC, rejection does NOT poison the cache security: - Algorithm: only AWS4-HMAC-SHA256 prefix is claimed - Scope: cross-service replay (s3->sts) and cross-region replay (eu-west-1->us-east-1) rejected at parse-time - Cache: bucketing by YYYYMMDD bounds stolen-key window to a day - Body cap: 64 KiB on STS responses - Logs: STS error bodies summarized at 200 chars, newlines stripped audit: - ErrTokenNotForMe -> not_for_me (no AWS4 prefix) - ErrInvalidToken -> invalid (malformed Sigv4) - ErrTokenRejected -> rejected (scope/allowlist/STS 4xx) - ErrProviderUnavailable -> provider_unavailable (STS 5xx/network) extras: - security.AuthDomains gains sts.<region>.amazonaws.com (+ override host when sts_endpoint set for tests) - forge-cli/runtime/runner.go side-effect imports aws_sigv4
gcp_iap consumes the X-Goog-Iap-Jwt-Assertion header that GCP's Identity-Aware Proxy forwards on every authenticated request when Forge sits behind a GCP HTTPS load balancer with IAP enabled. Decision §9.4: IAP issuer (https://cloud.google.com/iap) and JWKS URL (https://www.gstatic.com/iap/verify/public_key-jwk) are hardcoded. They're the only stable contract GCP exposes; an override knob would be a footgun. key pieces: - iap_jwks.go: ES256-only JWKS cache, TTL refresh + backoff + stale-grace (mirrors Phase 1 OIDC review initializ#1 pattern) - provider.go: header-presence check, claims projection, iss/aud gates, sub/email required-claims check - parseECJWKSet drops non-EC / non-P-256 / non-ES256-labeled keys during parse — defense in depth against compromised JWKS - alg whitelist rejects RS256 BEFORE key lookup (algorithm- confusion defense) - aud as string OR array both parse (JWT spec allows either) - audit reasons follow Phase 1 contract: rejected — iss/aud mismatch, expired, bad signature invalid — alg != ES256, missing sub/email, bad kid provider_unavailable — JWKS fetch failed AND no prior key cached not_for_me — header absent extras: - security.AuthDomains returns www.gstatic.com when gcp_iap is configured - forge-cli/runtime/runner.go side-effect imports gcp_iap
azure_ad authenticates Microsoft Entra ID tokens. Composes the
Phase 1 oidc.Provider (decision §9.2) for signature verify + base
claim validation; layers AAD-specific concerns on top:
- Tenant lock-in via the tid claim
- Optional Microsoft Graph group enrichment when JWT groups claim
is empty (AAD truncates at ~200 groups)
- Single-tenant vs multi-tenant issuer template
key pieces:
- provider.go: composed oidc + tenant gate + Source overwrite to
"azure_ad" (replaces the inner "oidc" stamp)
- tenant.go: ExtractTenantID — typed accessor for the tid claim
- graph_client.go: Graph /me/transitiveMemberOf with pagination,
same-host enforcement (rejects redirect attacks), 401/403 ->
ErrTokenRejected, 5xx -> ErrProviderUnavailable, defensive
cap at 5000 groups, body cap 1 MiB per page
- graph_cache.go: 5 min TTL, same shape as aws_sigv4's cache
key decisions:
- oidc.Config gains internal SkipIssuerCheck flag with yaml:"-"
so it CANNOT be set via forge.yaml — only callable from another
Go package. AAD multi-tenant uses it; everything else leaves it
off. Surfacing it in YAML would let operators disable iss
validation by accident.
- Soft-fail on Graph 5xx/401: Identity returned with empty Groups
rather than blocking prod traffic. Hard-fail mode (graph_required)
out of scope for v0.11.
- Forge reflects the CALLER's Bearer to Graph; holds no Graph
credentials of its own.
audit reasons:
- ErrTokenRejected -> rejected (tid mismatch, bad sig, Graph 401)
- ErrInvalidToken -> invalid (missing tid, malformed claims)
- ErrProviderUnavailable -> provider_unavailable (Graph 5xx, JWKS down)
extras:
- security.AuthDomains returns login.microsoftonline.com always;
graph.microsoft.com when groups_mode=graph
- forge-cli/runtime/runner.go side-effect imports azure_ad
…(phase 2 pr 5) Wires aws_sigv4, gcp_iap, and azure_ad into the operator surfaces: cli (forge-cli/cmd/init*.go): - New non-interactive flags namespaced --auth-aws-* / --auth-gcp-iap-* / --auth-azure-* (StringSlice for repeatable allowed-principal globs) - buildAuthFromFlags validates required combinations and emits the right egress hosts per provider (sts.<region>.amazonaws.com, www.gstatic.com, login.microsoftonline.com + graph.microsoft.com when groups_mode=graph) - authEgressHostsFromSettings mirrors the same logic for the Web UI - renderAuthBlock supports []string lists with proper YAML quoting (allowed_principals) web ui (forge-ui/handlers_create.go): - AuthProviderTypeMeta lists the three new types with helpful labels validate (forge-core/validate/auth.go): - knownAuthProviderTypes admits aws_sigv4 / gcp_iap / azure_ad - validateProviderSettings enforces per-type required keys (aws_sigv4.region, gcp_iap.audience, azure_ad.audience + tenant_id-unless-multi-tenant, azure_ad.groups_mode whitelist) tests: - 11 new renderer + flag-parsing tests - Round-trip YAML parse used instead of brittle quote-pattern asserts - Updated wizard-meta test to expect 7 auth provider types deliberate scope cut: - TUI step_auth.go sub-step input flows for the 3 new providers are NOT included. Adding them is mechanical (~100 LOC per provider, mirroring the OIDC issuer→audience→groups_claim phase chain) but out of scope for v0.11.0 cut. Non-interactive flag path covers the production-critical CI/CD case; operators using the TUI can pick "Custom" and edit forge.yaml directly until the follow-up lands.
…pr 6) Adds the operator-facing documentation for the three Phase 2 providers that shipped in PRs 1–5, plus a top-level auth index, chain-semantics concepts page, CHANGELOG, and a README link. new docs: - docs/auth/index.md — provider matrix and chain-semantics overview - docs/auth/concepts/chain.md — first-match-wins, no-fall-through on reject, non-Bearer header support, mixed-chain worked example - docs/auth/providers/aws_sigv4.md — STS reflection setup, awscurl example, assumed-role-vs-IAM-role gotcha called out twice - docs/auth/providers/gcp_iap.md — backend service ID lookup steps, hardcoded JWKS rationale, GCP IAM Conditions for allowlisting - docs/auth/providers/azure_ad.md — app registration walkthrough, single/multi/graph mode configs, multi-tenant warning prominent every provider doc includes: - Prerequisites checklist - forge.yaml example - Configuration reference table - Audit log shape (literal JSON) - Troubleshooting matrix (grep-able reason codes) - Security model + limitations sections CHANGELOG.md (new file): - Lists Added / Changed entries for v0.11.0 - "Notes for upgraders" makes the non-breaking nature explicit - Calls out the known TUI sub-flow gap from PR 5 README.md: - Adds Auth Providers row to the Security documentation table
…owlist
The wizard was asking for Egress confirmation before the operator had
picked an auth provider, so STS / AAD authority / IAP JWKS hosts never
appeared in the egress list. Forge would scaffold a forge.yaml whose
egress_hosts blocked its own auth-provider RPC calls — failure happens
later at `forge run`, with no signal the wizard could have caught.
changes:
- Swap step order in init.go: Auth now runs immediately before Egress
- Extend DeriveEgressFunc with (authMode, authSettings) so the Egress
step's Prepare(ctx) pulls the operator's auth choice from
WizardContext and forwards it into deriveEgressDomains
- deriveEgressDomains calls authEgressHostsFromSettings (same helper
the non-interactive --auth=… path uses) — TUI and CLI now produce
identical egress lists for any given auth choice
- EgressStep's inferSource() learns to label auth-derived hosts:
sts.<region>.amazonaws.com → "aws_sigv4 auth"
www.gstatic.com → "gcp_iap auth"
login.microsoftonline.com → "azure_ad auth"
graph.microsoft.com → "azure_ad auth (graph)"
<oidc issuer host> → "oidc auth"
<http_verifier url host> → "http_verifier auth"
tests:
- TestDeriveEgressDomains_AuthProviderHostsMerged: 8 cases pinning the
per-provider host emission (incl. graph-mode adds graph host)
- TestDeriveEgressDomains_AuthHostsMergeNotOverwrite: auth pass is
additive — provider / channel hosts still emit alongside auth hosts
docs:
- docs/auth/concepts/chain.md gains a "TUI wizard ordering" section
explaining the Auth-before-Egress invariant
…, cleanup Final-pass audit findings against the phase 2 design doc surfaced one correctness bug and several small improvements. All gates clean (go test -race / golangci-lint / gofmt). 42 packages pass. BUG fix — middleware emits token_kind="iap_jwt" for IAP requests: The strategy doc §5/§10 lists five token_kind values: empty, opaque, jwt, sigv4, iap_jwt. PR1 wired sigv4 detection but missed iap_jwt, so successful GCP IAP requests audited with token_kind="empty" — the same value as no-auth requests, defeating the audit-pipeline goal of counting IAP traffic distinctly. Middleware now classifies X-Goog-Iap-Jwt-Assertion presence as kind="iap_jwt" on the empty-Bearer path. New regression test pins it. Improvement — graph_client.go avoids per-page URL re-parse: ensureGraphHost was parsing GraphClient.endpoint via url.Parse on EVERY pagination step. Pre-parse the endpoint Host once at construction and compare against that string instead. Trims redundant work on multi-page Graph responses. Improvement — gcp_iap classifyJWTErr ordering hardened: Replaced the bare substring match on "kid" (which would catch unrelated errors) with the specific patterns: "kid " (e.g. "kid X not found") and "not found" (covers JWKS-resolution failures). Pre-existing ordering invariant comment is now actually defended. Cleanup — drop redundant single-function file: Moved ExtractTenantID from azure_ad/tenant.go into provider.go alongside other claim accessors and removed the empty tenant.go. The function was a 1-liner and didn't justify its own file. Cleanup — inline audienceContains shim: Replaced the audienceContains() wrapper (one-liner around slices.Contains) with a direct call at the use site. Less indirection, same behavior. Cleanup — middleware: simplify hasNonBearerAuth boolean expr: Folded the multi-line if-chain into a single boolean expression. Same semantics, less noise. audit findings deferred as nits, not fixed: - aws_sigv4 Parser as zero-value struct (cosmetic; keeps symmetry) - egress_step.go hostOf manual URL parsing (cosmetic; non-hot path) - 10k eviction comment wording audit findings confirmed not bugs: - GraphCache TTL test (already exists in graph_cache_test.go) - PrependChain loopback invariant intact (runner.go line 2036)
The Phase 2 provider docs were committed as MD files under docs/auth/ but we don't want to version-control them — the source-of-truth lives in the design folder, and we'll deliver via the doc site separately. - .gitignore: add docs/auth/ - git rm --cached docs/auth/** (local files preserved) - README.md: drop the now-broken "Auth Providers" docs row - CHANGELOG.md: drop the docs/auth/*.md links from the v0.11.0 entry No code or test changes.
…ontract Real-AWS testing surfaced a documentation gap: callers cannot use raw `awscurl` / `aws-sdk-go` against Forge's `aws_sigv4` provider because Sigv4 binds the signature to the destination host. Standard tools sign for the URL they're addressing (Forge) — STS then rejects the reflected signature because the host bytes don't match. The server-side code is correct. The client just needs to sign a hypothetical STS request, then attach the resulting headers to its real POST to Forge. Same pattern as aws-iam-authenticator for EKS. This commit: - Ships `scripts/forge-aws-sign.py`, a ~100 LOC reference client using boto3.session + SigV4Auth. CLI flags for --region, --url, --profile, --body, --verbose. Reads SSO/IRSA/profile/env credentials via boto3's standard chain. - Extends the package-level docstring in `forge-core/auth/providers/aws_sigv4/provider.go` with a "Client-side signing contract" section spelling out the 4-step pattern and pointing readers to the reference script. - Adds a "Client-side requirement" section to CHANGELOG.md so adopters know to grab the helper or write their own before integrating. Validated against real AWS: - STS reflection: 200, identity stamped, correct ARN/Account/UserID - ARN allowlist match: 200 (matching pattern) - ARN allowlist miss: 401 reason=rejected (correct authz gate) - No-auth: 401 reason=missing_token (Phase 1 contract preserved)
…uthenticator) Phase 2 PR 2's original "reflect Sigv4 headers" design was broken in the obvious way: Sigv4 binds its signature to the destination host as part of the canonicalized signing input. Headers signed for Forge's host could not be replayed against STS — STS sees host:sts.<region>. amazonaws.com, recomputes the signature, gets a different hash, rejects with "SignatureDoesNotMatch". Caught during real-AWS smoke; documented in PR initializ#79 description. This commit replaces the pattern with the same approach aws-iam-authenticator uses for EKS: Client (3 lines): url = boto3.client('sts').generate_presigned_url('get_caller_identity', ExpiresIn=900) token = 'forge-aws-v1.' + base64.urlsafe_b64encode(url.encode()).rstrip(b'=').decode() requests.post(forge_url, headers={'Authorization': f'Bearer {token}'}, ...) Server: Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url> → decode + validate host (SSRF guard) + GET on the URL → STS → identity Net effect on caller experience: identical to JWT/OIDC/azure_ad — "mint token, send Bearer, done." Three lines of client code, hidden in ~15 lines of any AWS SDK in any language. what changed: forge-core/auth/providers/aws_sigv4/ sigv4_parser.go — was parsing AWS4-HMAC-SHA256 Authorization header now parses forge-aws-v1.<base64-url> Bearer tokens (URL host validation, SSRF guard, X-Amz-Credential parsing for cache key derivation) sts_client.go — was POST with reflected headers now GET on the pre-signed URL; same 200/4xx/5xx classification and 64 KiB body cap provider.go — Verify() now reads the Bearer token (not raw headers); SSRF guard via expectedHost field; same cache + ARN allowlist semantics forge-core/auth/ provider.go — HeadersFromRequest reverts X-Amz-Date and X-Amz-Security-Token (no longer needed); keeps X-Goog-Iap-Jwt-Assertion for gcp_iap provider.go — TokenKind detects "forge-aws-v1." prefix → "sigv4" (was: "AWS4-HMAC-SHA256 " on raw Authorization) middleware.go — simplify: empty-Bearer fallback only handles IAP (aws_sigv4 rides standard Bearer flow now) scripts/forge-aws-sign.py — rewrite as a clean reference client. --token-only: print just the token for use with curl/other tools Otherwise: do the round-trip POST and print the response CHANGELOG.md — replace "client wrapper required" friction note with the 3-line happy path snippet what stays unchanged: - forge.yaml shape (still type: aws_sigv4, region:, allowed_principals:) - identity_cache.go, arn_matcher.go (cache and authz logic untouched) - security.AuthDomains (sts.<region>.amazonaws.com derivation) - forge-cli/cmd/init* flag set and renderer - validate.ValidateAuthConfig (region still required) - forge-ui/handlers_create.go (AuthProviderTypeMeta entry) Tests: 42 packages pass, golangci-lint v2.10.1 clean, gofmt clean, no aws-sdk-go imports (decision §9.1 still holds). Net diff: +732 / -625 lines (mostly test rewrites; ~80 LOC net less in the provider package because the new flow is structurally simpler).
…n client Two correctness fixes surfaced by live AWS testing of the pre-signed URL pattern from b3444c2. 1. Preserve the raw URL byte-for-byte. Round-tripping the presigned URL through Go's net/url package re-encoded query params in subtle ways (e.g. "/" in X-Amz-Credential, "+" inside X-Amz-Security-Token) that didn't match how the AWS SDK emitted them on the caller side. STS recomputes the canonical request using whatever bytes we send and gets a different hash → 4xx SignatureDoesNotMatch → audit reason "rejected". - PresignedToken gains a RawURL field — the exact bytes from the decoded token payload. - The parsed *url.URL is kept ONLY for SSRF host validation and query-param inspection. It is NEVER used to construct the outbound request. - Provider.Verify now passes parsed.RawURL to STSClient.GetCallerIdentity. 2. Use SigV4QueryAuth directly in the reference client (not boto3's high-level generate_presigned_url). boto3.client('sts').generate_presigned_url('get_caller_identity', ...) produces a URL STS rejects with SignatureDoesNotMatch when GET. Known quirk — the high-level presigner signs as if the request were a POST. aws-iam-authenticator works around this by signing the AWSRequest explicitly; scripts/forge-aws-sign.py now does the same: req = AWSRequest(method='GET', url='https://sts.{region}.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15') SigV4QueryAuth(creds, 'sts', region, expires=900).add_auth(req) token = 'forge-aws-v1.' + base64.urlsafe_b64encode(req.url.encode()).rstrip(b'=').decode() Live validation against real AWS (account 412664885516, SSO assumed-role): - Happy path: HTTP 400 body-shape error + auth_verify with correct ARN - Deny path: HTTP 401 + auth_fail reason="rejected" + token_kind="sigv4" 42 packages still pass; golangci-lint clean; gofmt clean. (Known follow-up surfaced but out of scope: hot-reload of forge.yaml doesn't rebuild the auth chain, so allowlist changes require a hard restart. Same caveat affects all providers, not just aws_sigv4.)
Two paired additions that make Phase 2 provider onboarding actually
usable end-to-end via the TUI.
## allowed_accounts (ergonomic shortcut)
aws_sigv4.Config gains AllowedAccounts []string. Each 12-digit AWS
account ID expands at Factory time into the canonical glob set
covering every STS identity shape:
arn:aws:iam::<acct>:user/* — direct IAM users
arn:aws:iam::<acct>:role/* — direct IAM roles
arn:aws:sts::<acct>:assumed-role/*/* — SSO, AssumeRole, IRSA
arn:aws:sts::<acct>:federated-user/* — SAML / web-identity federation
So an operator who wants "anyone in this account" writes one line of
config instead of four globs. Composes with allowed_principals — list
specific roles AND whole accounts in the same provider entry.
Validation:
- validateAccountID checks the 12-digit shape at Factory time
- validate.ValidateAuthConfig catches malformed entries at
`forge validate` time (before scaffold writes forge.yaml)
## TUI sub-flows for all three Phase 2 providers
forge init's TUI picker now has 7 entries (was 4):
None / OIDC / HTTP Verifier / AWS Sigv4 / GCP IAP / Azure AD / Custom
Input flows:
aws_sigv4 region → audience (opt) → accounts (opt) → done
gcp_iap audience → done
azure_ad tenant → audience → done (single-tenant only)
Azure AD intentionally restricts the TUI to single-tenant. Enabling
allow_multi_tenant is a deliberate security trade-off (any Entra tenant
in the world is admissible) and should require editing forge.yaml, not
clicking through a wizard.
Egress hosts auto-computed from the selection:
aws_sigv4 → sts.<region>.amazonaws.com
gcp_iap → www.gstatic.com (hardcoded §9.4)
azure_ad → login.microsoftonline.com
So the Egress review step (which runs after Auth per 639bfa9) shows
the operator the full outbound surface they're about to allow,
including the auth-provider's STS / IAP / AAD endpoints.
## Other surfaces
forge init --auth-aws-allowed-account flag (repeatable) — matches the
non-interactive path the wizard takes internally.
CHANGELOG: drops the "TUI deferred" note (no longer deferred), adds
sections for allowed_accounts and the TUI sub-flows.
## Tests
arn_matcher_test.go +3 tests (validateAccountID, expansion shapes,
expanded patterns match realistic ARNs)
provider_test.go +4 tests (AllowedAccounts happy path, deny path,
factory rejects malformed, mix with
AllowedPrincipals)
step_auth_test.go +6 tests (AWS full flow, AWS optional skips,
AWS bad account rejection, GCP flow,
AAD flow, Phase 2 summaries)
validate/auth_test.go updated validator covers allowed_accounts
42 packages pass go test -race -count=1; golangci-lint v2.10.1 clean;
gofmt clean.
Live wizard test surfaced a YAML emission bug: AWS account IDs like
"412664885516" were rendered unquoted, e.g.
allowed_accounts:
- 412664885516
yaml.v3 decodes that as !!int → cannot unmarshal into []string in
the aws_sigv4 provider's Config.AllowedAccounts field → provider
construction fails at startup.
needsYAMLQuoting now treats any all-digit string as requiring
quotes. The renderer emits the correct form:
allowed_accounts:
- "412664885516"
Fix is general — applies to anything in the auth-settings schema
that's a digit string (ZIP-shaped IDs, version segments, etc.) so
this same bug class can't surface for a new provider later.
Live-validated after the fix:
- auth_verify on the matching allowed_accounts (happy)
- auth_fail reason=rejected on a wrong account (deny)
Phase 2 review — three providers, focused on the outbound-HTTP defense-in-depth gapsI ran a security-focused audit across the four surfaces: The pattern in the BLOCKERs below is the same in three places: the outbound BLOCKERS (please fix before merge)B1. B2. B3. MAJOR (should land in this PR or a fast follow-up)M1. M2. M3. M4. M5. NITs / follow-ups (don't block merge)
VerdictNeeds revision before merge — three BLOCKERs (B1–B3) are each 1–3 line What's not a concern:
Live AWS validation in the PR description is excellent — that level of evidence for the happy and deny paths is exactly the right bar. Once the three redirect fixes land, this is ready. |
|
@naveen-kurra also run /sync-docs command/plugin for Claude to sync the document updates based on the changes |
Phase 2 review (PR initializ#79) flagged three same-root-cause issues across the outbound HTTP clients in azure_ad, aws_sigv4, and gcp_iap: - B1: ensureGraphHost accepted http:// nextLinks — caller's Bearer would leak in plaintext (Go strips Authorization on cross-host redirect but NOT on https→http same-host downgrade). - B2: Graph client used Go's default redirect policy (follow up to 10). ensureGraphHost only validates @odata.nextLink, not HTTP 301/302/307. A redirect from Graph → attacker URL would let the attacker control the JSON body that becomes Identity.Groups (latent until Phase 4 authz lands; closing at the boundary now). - B3: aws_sigv4's STS client had the same default-redirect issue. The parser-side same-host gate covers only the first hop; a redirect off sts.<region>.amazonaws.com would let attacker bytes become the parsed STS XML and control Identity.UserID/OrgID/Arn. Verified all three are real (Go default CheckRedirect follows up to 10 per net/http docs). Smoking gun confirmed in five outbound clients across forge-core/auth/providers/: aws_sigv4/sts_client.go ← fixed (B3) gcp_iap/iap_jwks.go ← fixed (not in review; same pattern, same blast radius — attacker JWKS means forged tokens verify) azure_ad/graph_client.go ← fixed (B1+B2) oidc/provider.go ← Phase 1 code, NOT touched here; OIDC discovery sometimes legitimately redirects, needs design discussion. Filed as follow-up on issue initializ#80. httpverifier/provider.go ← Phase 1 code, NOT touched here; operator-configured URL may be behind LB. Follow-up on issue initializ#80. Fix shape (uniform across the three Phase 2 providers): http.Client{ Timeout: timeout, CheckRedirect: func(*http.Request, []*http.Request) error { return http.ErrUseLastResponse }, } These three endpoints (STS GetCallerIdentity, IAP JWKS, Graph transitiveMemberOf) NEVER legitimately issue 3xx — the auth contract each provider validates is bound to the configured host. Returning ErrUseLastResponse causes the 3xx response to flow into our existing status-code switch, where the "unexpected status" default arm maps it to ErrProviderUnavailable (and the audit logs "provider_unavailable", not a misleading "rejected"). B1 specific (separate from the redirect-policy change): ensureGraphHost now checks both Host AND Scheme. GraphClient stores endpointScheme alongside endpointHost, pre-parsed at construction so the per-page check stays cheap. Test mode (httptest's http://) still works because the configured scheme is matched against, not hard- coded https. Regression tests added (one per provider): TestSTSClient_DoesNotFollowRedirects TestJWKSCache_DoesNotFollowRedirects TestGraphClient_DoesNotFollowRedirects Each spins up an httptest server that returns 302 → attacker URL, runs the client, asserts: - error returned (not silently followed) - error is ErrProviderUnavailable (correct audit reason) - endpoint hit exactly once (counts redirect-follows) Plus two scheme-specific cases for B1: TestEnsureGraphHost_RejectsSchemeDowngrade — https→http same-host TestEnsureGraphHost_TestModeHTTPOK — httptest http stays OK 42 packages pass go test -race -count=1; golangci-lint v2.10.1 clean; gofmt clean.
Phase 2 review M1: a token whose pre-signed URL contained userinfo (https://user:pass@sts.us-east-1.amazonaws.com/...) would pass our EqualFold(u.Host, expectedHost) check because RFC 3986 separates userinfo from host — net/url puts "user:pass" into u.User and "sts.us-east-1.amazonaws.com" into u.Host independently. http.Client.Do then synthesizes Authorization: Basic <b64(user:pass)> from u.User and ships those attacker-controlled bytes to STS. STS ignores Basic (it validates the X-Amz-Signature in the query string), so this isn't an active auth bypass — but attacker bytes still leave our box, with potential for exfiltration via STS access logs / timing. Fix: reject u.User != nil at parse time, BEFORE the host check, in sigv4_parser.go's ParseToken. One line; defense-in-depth. Regression test TestParseToken_RejectsUserinfo confirms the userinfo case is now caught and that the error message mentions "userinfo" (so future maintainers know what tripped). 42 packages still pass; lint + gofmt clean.
Phase 2 review M2: STS enforces ~15min on X-Amz-Date + X-Amz-Expires
server-side, but our IdentityCache (60s default TTL) would happily
serve a cached Identity even if STS somehow accepted a stale token.
The cache key being hash(AKID|YYYYMMDD) gave the impression of
day-long replay potential — it doesn't, because the 60s TTL bounds
it — but the parser had zero own-source freshness check, so the
stolen-token replay window depended entirely on STS's enforcement.
Adding parser-side freshness as defense-in-depth:
PresignedToken gains two fields:
SigTime time.Time ← parsed from X-Amz-Date
Expires time.Duration ← parsed from X-Amz-Expires
ParseToken now requires both query params (rejects missing /
malformed / non-positive Expires). Parsing stays pure: no clock
comparison happens inside ParseToken.
PresignedToken.CheckFreshness(now, maxExpires, skew) gates:
- X-Amz-Expires > maxExpires → cap (default 15min — matches
what all standard AWS SDKs
emit for GetCallerIdentity)
- now > SigTime + Expires + skew → token expired
- SigTime > now + skew → token from the future
skew default 5min for normal clock drift.
Config additions (both default-able, no operator action needed):
MaxTokenExpires time.Duration `yaml:"max_token_expires,omitempty"`
ClockSkew time.Duration `yaml:"clock_skew,omitempty"`
Provider gains a `now func() time.Time` field, default time.Now,
overridable in tests so freshness can be exercised without clock
monkey-patching.
Provider.Verify calls CheckFreshness right after region check, before
the cache lookup — STS round-trip and identity cache hit are both
short-circuited when freshness fails. ErrTokenRejected, audit reason
"rejected".
Tests:
parser-level (sigv4_parser_test.go):
TestParseToken_PopulatesSigTimeAndExpires
TestParseToken_RejectsMissingAmzDate
TestParseToken_RejectsMalformedAmzDate
TestParseToken_RejectsMissingAmzExpires
TestParseToken_RejectsNonNumericExpires
TestParseToken_RejectsNonPositiveExpires
TestCheckFreshness_Expired
TestCheckFreshness_FromTheFuture
TestCheckFreshness_ExceedsExpiresCap
TestCheckFreshness_HappyPathInsideSkew
provider-level (provider_test.go):
TestProvider_RejectsExpiredToken
TestProvider_RejectsTokenFromFuture
TestProvider_RejectsOverlyLongExpiresClaim
TestProvider_AcceptsTokenAtEdgeOfSkewWindow
Existing test fixtures updated:
tokenFor() now stamps X-Amz-Date built from fixedTestTime so
freshness passes by default; newTestProvider pins Provider.now
to that same instant. Day-bucket rollover test advances the
clock by 24h to match its day-2 token.
42 packages pass, lint + gofmt clean.
Phase 2 review M4: structural token_kind detection runs BEFORE the
chain, so a request with both a Bearer JWT and an X-Goog-Iap-Jwt-
Assertion records kind="jwt" — masking IAP-fronted traffic in audit
dashboards even when gcp_iap was the actual verifier.
The structural rule answers "what bytes were on the wire?" The
audit signal we want answers "which auth path verified?" Those
diverge only when a Bearer is present alongside a non-Bearer
provider's payload — today, gcp_iap is the only such provider.
Fix:
identity, err := chain.Verify(...)
if err == nil {
kind = refineTokenKind(kind, identity.Source)
}
refineTokenKind:
- Source == "gcp_iap" → return "iap_jwt"
- otherwise → return structural kind unchanged
Other providers (oidc, azure_ad, aws_sigv4, http_verifier,
static_token) don't need refinement: their structural kind
already matches the auth path that verifies them.
Failure paths still record the structural kind (no chain identity
to read Source from), so the existing "missing_token" /
"not_for_me" audit reason-code contract is unaffected.
Tests:
TestMiddleware_TokenKind_RefinedToIapJwtWhenGCPIAPVerifies
— Bearer JWT + IAP header both present; chain stubbed to
return Source="gcp_iap"; assert kind upgrades to iap_jwt
TestMiddleware_TokenKind_JWT_NotRefinedForOIDCProviders
— counter-test: Source="oidc" must NOT trigger the refinement
…nt doc (M3)
Phase 2 review M3:
1. When allow_multi_tenant: true, both the iss check (via composed
oidc.Provider's SkipIssuerCheck) AND the tid check were
suppressed — any Entra tenant in the world verified.
2. resolveIssuer()'s docstring said "azure_ad enforces tenant via
the tid claim instead" which was misleading: in multi-tenant
mode there was no tid enforcement at all.
Fix is two-part: add the missing knob, fix the doc.
allowed_tenants (Config.AllowedTenants []string)
Optional allowlist of Entra tenant GUIDs matched against the JWT
`tid` claim. Only meaningful with allow_multi_tenant=true.
Three operational modes:
single-tenant (default) tid MUST equal TenantID
multi-tenant + AllowedTenants set tid MUST be in list
multi-tenant + AllowedTenants empty no tid check ("any
tenant globally" —
documented high-risk
shape, opted into by
deliberately omitting)
Match is case-insensitive (Entra emits lowercase GUIDs; operators
often paste uppercase from the portal).
Factory-time validation rejects allowed_tenants in single-tenant
mode (TenantID is THE gate there; the combination would silently
degrade if not caught).
Validator + non-interactive flag + tests
- forge-core/validate/auth.go: rejects allowed_tenants in single-
tenant mode at validate-time; WARNS when multi-tenant+empty so
the "any tenant globally" trade-off is loud
- forge-cli/cmd/init.go: --auth-azure-allowed-tenant flag
(repeatable, mirrors --auth-aws-allowed-account)
- forge-cli/cmd/init_auth.go: buildAuthFromFlags forwards through
- forge-cli/cmd/init_auth_test.go: mock adds the new flag
Five new tests (provider_test.go):
AllowedTenants_AcceptsListed
AllowedTenants_RejectsUnlisted
AllowedTenants_CaseInsensitive
AllowedTenants_MissingTidRejected
SingleTenant_WithAllowedTenants_RejectedAtFactory
resolveIssuer() docstring rewrite
Now spells out the three operational modes explicitly, including
the "any-tenant" shape with its security implication. No more
"azure_ad enforces tenant via the tid claim" line that's only
half-true.
CHANGELOG gains an "allowed_tenants" section with the canonical
recipe.
42 packages pass, lint + gofmt clean.
Phase 2 review M5: validateProviderSettings was permissive on unknown
keys. A typo like `aud:` instead of `audience:` would slip past the
required-key check silently (asString returns ""), and — more
critically — the forge-ui handler forwarded a.Settings unfiltered to
the on-disk scaffold. A POST like
{"settings": {"audience": "x", "evil_key": "y"}}
would write evil_key into forge.yaml verbatim. Today's provider Config
structs ignore unknown YAML fields via yaml.v3's default, so it's
harmless — but one missed `yaml:"-"` tag on a future field would
suddenly make it reachable from untrusted POST input. Closed at the
boundary now.
Two-part fix:
1. forge-core/validate/auth.go
- New exported KnownAuthProviderSettings map[string]map[string]bool
(closed whitelist per provider type — mirrors the yaml: tags on
each provider's Config struct, kept in sync by convention).
Internal yaml:"-" fields are intentionally absent.
- validateProviderSettings emits a Warning per unknown settings
key during `forge validate`. Loose-not-strict because some
operators stash custom annotations from pre-Phase-2 days;
warning surfaces the typo without breaking those configs.
- New exported FilterKnownSettings(providerType, settings)
returns a copy with unknown keys dropped. Defense-in-depth
filter for use at trust boundaries.
2. forge-ui/handlers_create.go
- validateAuthPayload calls FilterKnownSettings on the incoming
a.Settings BEFORE validation OR scaffolding. Closes the
specific exploit chain (Web UI POST → forge.yaml) regardless
of yaml tag discipline drift on Config structs.
Tests:
validate package:
TestValidateAuthConfig_WarnsOnUnknownSettingsKey
TestValidateAuthConfig_NoWarningForKnownKeys
TestFilterKnownSettings_DropsUnknownKeys
TestFilterKnownSettings_UnknownProviderTypePassthrough
TestFilterKnownSettings_AllPhase2Providers (subtests per provider/key)
forge-ui:
TestHandleCreateAgent_FiltersUnknownAuthSettings — end-to-end
POST with evil_key in settings; assert it does NOT survive
into the scaffold's captured AuthCreateOptions.Settings
42 packages still pass; lint + gofmt clean.
Batch-clearing the "don't block merge" follow-ups from review of PR initializ#79. initializ#1 gcp_iap classifyJWTErr — use jwt v5 sentinels via errors.Is rather than substring matching (library wording shifts across patches; sentinels are public API). Special-case ErrTokenSignatureInvalid to split alg-confusion (→ ErrInvalidToken) from real bad-signature (→ ErrTokenRejected) because golang-jwt wraps both under that one sentinel. Three internal keyFunc message-matches retained — those are strings WE control, not the library's. initializ#2 gcp_iap JWKS merge-on-success — switched j.keys = newKeys to a per-kid merge. A partial-but-valid JWKS response (e.g. one kid accidentally omitted by GCP during rotation) no longer drops kids the stale-grace contract assumes we still have. Worst case is keeping a retired kid in cache; verification still fails naturally for any token signed with the retired private key. initializ#3 azure_ad GraphCache defensive copies — Get returns append([]string(nil), e.groups...) and Put stores a copy of its input. Caller mutating Identity.Groups (the auth.Identity layer treats it as freely mutable) can't poison the cache. initializ#4 forge-cli needsYAMLQuoting numeric edge cases — quote anything that resembles a YAML number (hex 0x, octal 0o, binary 0b, leading-zero "010", scientific 1e10, decimal float 3.14, signed ±N, .inf / .nan in either case). Auth-setting values rarely hit these shapes but the docstring promised "false negatives are bugs" and the Web UI POST path can supply arbitrary strings. Added looksNumeric() helper with separate allHexDigits / allOctalDigits / allBinaryDigits gates. initializ#5 aws_sigv4 identity_cache_test — replaced string(rune(i)) with strconv.Itoa(i). Surrogate code points (0xD800..0xDFFF) all map to U+FFFD, so the eviction-threshold test was silently building ~10 distinct keys instead of 10_001 and the sweep never ran. initializ#6 http.NewRequestWithContext error handling — fixed the two `req, _ := ...` antipatterns in gcp_iap/iap_jwks.go and azure_ad/graph_client.go. Hardcoded URLs make the failure currently unreachable, but errcheck-clean is the discipline. initializ#7 gcp_iap HS256-with-EC-public-key alg-confusion test — pinned the most dangerous attack shape: attacker fetches the verifier's public key from JWKS (open by design), uses raw X/Y bytes as the HMAC "secret", signs an HS256 token. A non-whitelisting verifier would HMAC-verify it. Our keyFunc rejects on alg != "ES256" BEFORE key lookup; this test confirms. Tests added: TestGraphCache_GetReturnsDefensiveCopy, TestGraphCache_PutStoresDefensiveCopy, TestProvider_HS256WithECPublicKeyAsSecret_Rejected. Existing TestProvider_RS256Token_Rejected still passes (alg-confusion still classified as ErrInvalidToken under the new sentinel-based path). 42 packages green, lint + gofmt clean.
Ran the /sync-docs recipe over PR initializ#79's surface (commit 745c024). Two pre-flight fixes before the sync: - .claude/commands/sync-docs.md mapping table was outdated: pointed at flat files (`docs/commands.md`, `docs/runtime.md`, …) that don't exist; real layout is nested under core-concepts/, reference/, security/. Rewrote the table to match the actual tree and added a `forge-core/auth/` → authentication.md + audit-logging.md row. - docs/auth/ is gitignored (per earlier review feedback). Did NOT move that content into git; the new canonical home is docs/security/authentication.md inside the existing tracked docs/security/ folder. New doc: docs/security/authentication.md (423 lines) — single home for the full auth provider chain across Phase 1 + Phase 2. Covers: - Provider matrix and chain semantics (first-match-wins, fail-closed on reject, loopback static_token auto-prepend) - Per-provider sections for static_token, oidc, aws_sigv4, gcp_iap, azure_ad, http_verifier — including yaml shape, wire format, client-side recipe, security model - aws_sigv4 client-side gotcha (boto3.generate_presigned_url doesn't work for STS; use SigV4QueryAuth) - AWS Org-wide trust recipes (Identity Center, entry role + aws:PrincipalOrgID condition) - Egress allowlist auto-extension table per provider - Wizard / CLI examples - Mesh patterns (single-account fleet, per-pair allowlist) Updates to satellite docs: docs/security/overview.md - Added Authentication layer to the security-architecture ASCII diagram (between Guardrails and Egress) - New "Authentication" section with provider table + chain rules - Authentication row in Related Documentation footer docs/security/audit-logging.md - auth_verify and auth_fail event entries in the Event Types table - Full example shape for both events - Reason-code table (missing_token, not_for_me, rejected, invalid, provider_unavailable) with operator actions - token_kind value table (empty, opaque, jwt, sigv4, iap_jwt) - Audit pipeline jq grep recipes for common queries docs/security/egress-control.md - New "Auth-provider domain auto-extension" subsection under Allowlist Resolution with the per-provider host table - Notes the wizard's Auth-before-Egress ordering docs/reference/forge-yaml-schema.md - Full auth: block inserted in the schema example with all 6 provider types and every documented settings key docs/reference/cli-reference.md - All --auth-* flags added to the forge init flag table - Two new example invocations (aws_sigv4 with allowed-account, azure_ad multi-tenant with allowlist) - Link to docs/security/authentication.md docs/reference/web-dashboard.md - Auth step added to the wizard steps table (between Fallback and Env & Security) - New "Auth step" subsection listing all 7 picker options + the Web UI's settings-key filter behavior README.md - Authentication row added to the Security documentation table Broken-link scan: zero broken links in tracked docs/. 42 packages still pass go test -race; lint + gofmt clean. No code changes in this commit — docs only.
DOne |
Re-review of fix commits — LGTM, ready to mergeVerified all 8 fix commits against the prior review. Tests pass with BLOCKERs — all cleared (
|
…uthenticator) Phase 2 PR 2's original "reflect Sigv4 headers" design was broken in the obvious way: Sigv4 binds its signature to the destination host as part of the canonicalized signing input. Headers signed for Forge's host could not be replayed against STS — STS sees host:sts.<region>. amazonaws.com, recomputes the signature, gets a different hash, rejects with "SignatureDoesNotMatch". Caught during real-AWS smoke; documented in PR #79 description. This commit replaces the pattern with the same approach aws-iam-authenticator uses for EKS: Client (3 lines): url = boto3.client('sts').generate_presigned_url('get_caller_identity', ExpiresIn=900) token = 'forge-aws-v1.' + base64.urlsafe_b64encode(url.encode()).rstrip(b'=').decode() requests.post(forge_url, headers={'Authorization': f'Bearer {token}'}, ...) Server: Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url> → decode + validate host (SSRF guard) + GET on the URL → STS → identity Net effect on caller experience: identical to JWT/OIDC/azure_ad — "mint token, send Bearer, done." Three lines of client code, hidden in ~15 lines of any AWS SDK in any language. what changed: forge-core/auth/providers/aws_sigv4/ sigv4_parser.go — was parsing AWS4-HMAC-SHA256 Authorization header now parses forge-aws-v1.<base64-url> Bearer tokens (URL host validation, SSRF guard, X-Amz-Credential parsing for cache key derivation) sts_client.go — was POST with reflected headers now GET on the pre-signed URL; same 200/4xx/5xx classification and 64 KiB body cap provider.go — Verify() now reads the Bearer token (not raw headers); SSRF guard via expectedHost field; same cache + ARN allowlist semantics forge-core/auth/ provider.go — HeadersFromRequest reverts X-Amz-Date and X-Amz-Security-Token (no longer needed); keeps X-Goog-Iap-Jwt-Assertion for gcp_iap provider.go — TokenKind detects "forge-aws-v1." prefix → "sigv4" (was: "AWS4-HMAC-SHA256 " on raw Authorization) middleware.go — simplify: empty-Bearer fallback only handles IAP (aws_sigv4 rides standard Bearer flow now) scripts/forge-aws-sign.py — rewrite as a clean reference client. --token-only: print just the token for use with curl/other tools Otherwise: do the round-trip POST and print the response CHANGELOG.md — replace "client wrapper required" friction note with the 3-line happy path snippet what stays unchanged: - forge.yaml shape (still type: aws_sigv4, region:, allowed_principals:) - identity_cache.go, arn_matcher.go (cache and authz logic untouched) - security.AuthDomains (sts.<region>.amazonaws.com derivation) - forge-cli/cmd/init* flag set and renderer - validate.ValidateAuthConfig (region still required) - forge-ui/handlers_create.go (AuthProviderTypeMeta entry) Tests: 42 packages pass, golangci-lint v2.10.1 clean, gofmt clean, no aws-sdk-go imports (decision §9.1 still holds). Net diff: +732 / -625 lines (mostly test rewrites; ~80 LOC net less in the provider package because the new flow is structurally simpler).
Phase 2 review (PR #79) flagged three same-root-cause issues across the outbound HTTP clients in azure_ad, aws_sigv4, and gcp_iap: - B1: ensureGraphHost accepted http:// nextLinks — caller's Bearer would leak in plaintext (Go strips Authorization on cross-host redirect but NOT on https→http same-host downgrade). - B2: Graph client used Go's default redirect policy (follow up to 10). ensureGraphHost only validates @odata.nextLink, not HTTP 301/302/307. A redirect from Graph → attacker URL would let the attacker control the JSON body that becomes Identity.Groups (latent until Phase 4 authz lands; closing at the boundary now). - B3: aws_sigv4's STS client had the same default-redirect issue. The parser-side same-host gate covers only the first hop; a redirect off sts.<region>.amazonaws.com would let attacker bytes become the parsed STS XML and control Identity.UserID/OrgID/Arn. Verified all three are real (Go default CheckRedirect follows up to 10 per net/http docs). Smoking gun confirmed in five outbound clients across forge-core/auth/providers/: aws_sigv4/sts_client.go ← fixed (B3) gcp_iap/iap_jwks.go ← fixed (not in review; same pattern, same blast radius — attacker JWKS means forged tokens verify) azure_ad/graph_client.go ← fixed (B1+B2) oidc/provider.go ← Phase 1 code, NOT touched here; OIDC discovery sometimes legitimately redirects, needs design discussion. Filed as follow-up on issue #80. httpverifier/provider.go ← Phase 1 code, NOT touched here; operator-configured URL may be behind LB. Follow-up on issue #80. Fix shape (uniform across the three Phase 2 providers): http.Client{ Timeout: timeout, CheckRedirect: func(*http.Request, []*http.Request) error { return http.ErrUseLastResponse }, } These three endpoints (STS GetCallerIdentity, IAP JWKS, Graph transitiveMemberOf) NEVER legitimately issue 3xx — the auth contract each provider validates is bound to the configured host. Returning ErrUseLastResponse causes the 3xx response to flow into our existing status-code switch, where the "unexpected status" default arm maps it to ErrProviderUnavailable (and the audit logs "provider_unavailable", not a misleading "rejected"). B1 specific (separate from the redirect-policy change): ensureGraphHost now checks both Host AND Scheme. GraphClient stores endpointScheme alongside endpointHost, pre-parsed at construction so the per-page check stays cheap. Test mode (httptest's http://) still works because the configured scheme is matched against, not hard- coded https. Regression tests added (one per provider): TestSTSClient_DoesNotFollowRedirects TestJWKSCache_DoesNotFollowRedirects TestGraphClient_DoesNotFollowRedirects Each spins up an httptest server that returns 302 → attacker URL, runs the client, asserts: - error returned (not silently followed) - error is ErrProviderUnavailable (correct audit reason) - endpoint hit exactly once (counts redirect-follows) Plus two scheme-specific cases for B1: TestEnsureGraphHost_RejectsSchemeDowngrade — https→http same-host TestEnsureGraphHost_TestModeHTTPOK — httptest http stays OK 42 packages pass go test -race -count=1; golangci-lint v2.10.1 clean; gofmt clean.
Batch-clearing the "don't block merge" follow-ups from review of PR #79. #1 gcp_iap classifyJWTErr — use jwt v5 sentinels via errors.Is rather than substring matching (library wording shifts across patches; sentinels are public API). Special-case ErrTokenSignatureInvalid to split alg-confusion (→ ErrInvalidToken) from real bad-signature (→ ErrTokenRejected) because golang-jwt wraps both under that one sentinel. Three internal keyFunc message-matches retained — those are strings WE control, not the library's. #2 gcp_iap JWKS merge-on-success — switched j.keys = newKeys to a per-kid merge. A partial-but-valid JWKS response (e.g. one kid accidentally omitted by GCP during rotation) no longer drops kids the stale-grace contract assumes we still have. Worst case is keeping a retired kid in cache; verification still fails naturally for any token signed with the retired private key. #3 azure_ad GraphCache defensive copies — Get returns append([]string(nil), e.groups...) and Put stores a copy of its input. Caller mutating Identity.Groups (the auth.Identity layer treats it as freely mutable) can't poison the cache. #4 forge-cli needsYAMLQuoting numeric edge cases — quote anything that resembles a YAML number (hex 0x, octal 0o, binary 0b, leading-zero "010", scientific 1e10, decimal float 3.14, signed ±N, .inf / .nan in either case). Auth-setting values rarely hit these shapes but the docstring promised "false negatives are bugs" and the Web UI POST path can supply arbitrary strings. Added looksNumeric() helper with separate allHexDigits / allOctalDigits / allBinaryDigits gates. #5 aws_sigv4 identity_cache_test — replaced string(rune(i)) with strconv.Itoa(i). Surrogate code points (0xD800..0xDFFF) all map to U+FFFD, so the eviction-threshold test was silently building ~10 distinct keys instead of 10_001 and the sweep never ran. #6 http.NewRequestWithContext error handling — fixed the two `req, _ := ...` antipatterns in gcp_iap/iap_jwks.go and azure_ad/graph_client.go. Hardcoded URLs make the failure currently unreachable, but errcheck-clean is the discipline. #7 gcp_iap HS256-with-EC-public-key alg-confusion test — pinned the most dangerous attack shape: attacker fetches the verifier's public key from JWKS (open by design), uses raw X/Y bytes as the HMAC "secret", signs an HS256 token. A non-whitelisting verifier would HMAC-verify it. Our keyFunc rejects on alg != "ES256" BEFORE key lookup; this test confirms. Tests added: TestGraphCache_GetReturnsDefensiveCopy, TestGraphCache_PutStoresDefensiveCopy, TestProvider_HS256WithECPublicKeyAsSecret_Rejected. Existing TestProvider_RS256Token_Rejected still passes (alg-confusion still classified as ErrInvalidToken under the new sentinel-based path). 42 packages green, lint + gofmt clean.
Summary
Phase 2 of the pluggable auth provider work — three cloud-native providers
on top of the Phase 1 foundation (#77 /
7998f12). Customers authenticateto Forge using identities they already have in their cloud; no parallel
IdP required.
aws_sigv4GetCallerIdentityURL with their AWS SDK; Forge invokes it. STS returns the caller's canonical ARN. Same pattern asaws-iam-authenticator(EKS).Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url>gcp_iapX-Goog-Iap-Jwt-Assertionwhen Forge sits behind a GCP HTTPS LB + IAP.X-Goog-Iap-Jwt-Assertion: <jwt>azure_adoidcprovider.Authorization: Bearer <aad-jwt>Forge never holds any IdP secrets — all three providers verify a caller-
minted credential against a third party (STS / GCP JWKS / AAD JWKS).
Why this matters
Today, putting Forge behind any of the three big cloud IdPs requires
standing up a parallel OIDC issuer (Cognito for AWS, Workspace SAML, etc.).
This PR removes that friction:
existing IAM credentials. Zero secrets stored on Forge, no token endpoint
to host.
assertion directly.
AAD-specific quirks (tenant gate, groups overage) handled correctly.
Design pivot during PR review —
aws_sigv4switched to pre-signed URL patternThe PR went through an in-flight design correction caught by real-AWS
smoke testing. The TL;DR:
What was wrong (original design — commits
9a1ebaethrough382294e)The first design had clients sign their POST to Forge using AWS Sigv4
("header reflection"). Forge would forward the signed headers to STS,
expecting STS to validate them.
This is broken in a deterministic way: Sigv4 binds its signature to
the destination host as part of the canonicalized signing input.
Headers signed for Forge's hostname can't be replayed against STS —
STS computes
host: sts.<region>.amazonaws.com, recomputes thesignature, gets a different hash, and rejects with
SignatureDoesNotMatch.Standard tools (
awscurl,boto3.client('sts'), all AWS SDKs) alwayssign for the URL they're calling, so there was no working client path.
What's correct (current design — commit
8568535)Switched to the pre-signed URL pattern that
aws-iam-authenticatoruses for EKS:
GetCallerIdentityURL — signatureembedded in query params, signed for STS's host.
forge-aws-v1., sends as astandard
Authorization: Bearer …header.sts.<region>.amazonaws.com(SSRF guard), GETs the URL, parses the XML response, stamps Identity.
Why this is the right design (not just a fix)
Three design properties land cleanly:
now have the same caller experience: mint a Bearer token, attach it,
send. Three lines of Python (or any AWS-SDK-bearing language). No
custom signing logic, no header manipulation, no per-call procedure.
aws-iam-authenticator-styletoken format works with any AWS SDK's
SigV4QueryAuthAPI.URL. ~80 LOC of STS client code instead of header-forwarding plumbing.
Locked design decisions (unchanged through the pivot)
aws-sdk-go-v2dependency. STS client is hand-rolled HTTP + XML.azure_adcomposes Phase 1oidc.Provider; no JWT/JWKS code inazure_ad/.allowed_principalsusespath.Matchshell globs (no regex).What clients write (the actual 3 lines)
A reference client lives at
scripts/forge-aws-sign.py(~80 LOC; CLI flagsfor
--region,--profile,--token-only, etc.).✅ Real-AWS validation (live AWS account)
End-to-end validated against real AWS STS using SSO assumed-role
credentials in account
412664885516. Forge instance built fromcommit
8568535, running locally.curlwith no auth headersmissing_tokenvalid bearer token requiredauth_verifyfiresauth_verifyemittedallowed_principalsauth_verifyfiresArnMatcheraccepts matching ARNsallowed_principalsrejectedtoken rejected by auth providerArnMatchercorrectly denies non-matching ARNsAudit log emitted (Test #2 success path)
{ "event":"auth_verify", "fields":{ "method":"POST", "path":"/tasks/send", "provider":"aws_sigv4", "user_id":"arn:aws:sts::412664885516:assumed-role/AWSReservedSSO_PowerUserAccess_c794d5f2c2fe4370/Naveen", "org_id":"412664885516", "token_kind":"sigv4", "groups_count":0, "remote_addr":"[::1]:62448" } }Every field correct:
providermatches the configured name,user_idis the STS-returned canonical ARN (assumed-role form, including session name),org_idis the AWS account number,token_kindis the newforge-aws-v1.-prefix detection.Two bugs caught during live testing (both fixed in
8568535)net/url. Round-tripping the presigned URL through Go's*url.URL.String()re-encoded query params in ways that differed from how the AWS SDK emitted them (e.g.,/inX-Amz-Credential,+insideX-Amz-Security-Token). STS recomputed the canonical request from those re-encoded bytes and rejected. Fix:PresignedTokenkeeps aRawURL stringfield with the byte-for-byte original; the parsed*url.URLis used only for SSRF host validation and query-param inspection.generate_presigned_urlquirk. The reference client originally usedboto3.client('sts').generate_presigned_url('get_caller_identity', …), which produces a URL STS rejects withSignatureDoesNotMatch(known boto3 quirk — signs as if POST). Fix: the reference client now uses the lower-levelSigV4QueryAuth.add_auth()directly, same patternaws-iam-authenticatoruses.Other layers also exercised live
sts.us-east-1.amazonaws.com— Forge's outbound STS call was permitted.static_tokenstill works for the dashboard athttp://localhost:9999/(auto-prepended via PrependChain — Phase 1 review Add per-agent secrets, build signing, and forge framework #10 invariant intact under Phase 2).forge.yamlcontent changes for most fields. Caveat: auth-chain providers are constructed once at startup; modifyingallowed_principalsrequires a hard restart (Ctrl-C+forge runagain) for the new allowlist to take effect. Documented as a follow-up — affects all providers, not justaws_sigv4.639bfa9) ensures wizard-scaffolded configs include the STS host inegress_hosts.What was NOT live-tested (and why)
gcp_iapprovider — requires a GCP project with HTTPS LB + IAP enabled. Covered by unit tests with a fake JWKS signer.azure_adprovider — requires an Entra tenant + app registration. Covered by unit tests with a fake AAD.Per
PHASE2_TEST_STRATEGY.md §8.2, those live tests run at release-tag time, not on every PR.What landed (12 commits)
55942d89a1ebaeaws_sigv4provider (initial — header-reflection design)5b71071gcp_iapprovider1e23140azure_adprovider47c474898578f9639bfa9b5f303baaf8375docs/auth(per reviewer feedback)382294eb3444c2aws_sigv4to pre-signed URL pattern8568535SigV4QueryAuthin clientTotal: ~42 files, +5,500 / -700 lines net (mostly tests + the design-pivot rewrite).
Phase 1 compatibility
static_token,oidc,http_verifier). Phase 1 test suite passes unmodified.Headersmap gained one new key —X-Goog-Iap-Jwt-Assertionforgcp_iap. Existing keys unchanged.oidcpackage gained an internalSkipIssuerCheckfield withyaml:"-"— unreachable fromforge.yaml, only set byazure_admulti-tenant. Operators see no change.Security model highlights
aws_sigv4— the pre-signed URL host MUST matchsts.<configured-region>.amazonaws.com. A token whose URL points elsewhere is rejected at parse time, before any outbound request.azure_ad— multi-tenant requires explicit opt-in; Graph calls only fire aftertidvalidation.Known deferred work
forge.yamldirectly until the TUI follow-up lands.auth.providers(incl.allowed_principals) require a hardforge runrestart. Affects all providers, not just Phase 2.Test plan
go test -race -count=1 ./...— 42 packages greengolangci-lint v2.10.1— 0 issuesgofmt -l forge-core forge-cli forge-plugins— cleanaws-sdk-goimport, IAP constants confined, no JWT inazure_ad,skip_issuer_checknever in YAML) — all passgcp_iapandazure_ad— runs at release-tag time perPHASE2_TEST_STRATEGY.md §8.2Design artifacts (offline)
Full design package in
~/Desktop/forge_designs_and_PRD/phase2_implementation/:PHASE2_CLOUD_NATIVE_PROVIDERS.md— top-level design + §9 locked decisionsPHASE2_PROGRESS_MAP.md— diagram-to-PR trackerPHASE2_TEST_STRATEGY.md— pyramid, harnesses, security catalog, CI gates, manual smokePR1_HEADER_CONTRACT.mdthroughPR6_DOCS.md— per-PR checklists with code sketches and acceptance criteria