Clusters with cluster-external control planes cannot start the multicluster gateway, readiness probes are blocked #7560

AaronFriel · 2022-01-03T23:32:41Z

What problem are you trying to solve?

On some Kubernetes distributions, requests from the control plane may not come from a private address range IP address or even a consistent IP address. This poses a problem, because the admin server used in a multicluster mesh needs to simultaneously serve /live and /ready routes to:

The Kubernetes control plane, for liveness and readiness probes respectively
Remote clusters as part of probing for remote gateway

In order to avoid exposing the other admin routes, the multicluster gateway uses an authorization policy forbidding unauthorized and out-of-cluster requests. This causes the gateway to fail readiness and liveness probes.

Example: On Linode Kubernetes Engine (LKE), probes originate from outside the cluster (e.g.: from 45.79.0.0/21), however the
ServerAuthorization policy on the linkerd-gateway is by default as only allowing localhost.

See these trace logs:

# This line edited for readability:
[    29.766629s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=45.79.3.202:60606}: linkerd_app_inbound::policy::authorize::http: Authorizing request policy=AllowPolicy { dst: OrigDstAddr(0.0.0.0:4191), 
  server: Receiver { shared: Shared { value: RwLock(RwLock { data: ServerPolicy { protocol: Http1, 
    authorizations: [
      Authorization { networks: [Network { net: 0.0.0.0/0, except: [] }, Network { net: ::/0, except: [] }], authentication: TlsAuthenticated { identities: {}, suffixes: [Suffix { ends_with: "" }] }, name: "linkerd-gateway-probe" }, 
      Authorization { networks: [Network { net: 10.0.0.0/8, except: [] }, Network { net: 100.64.0.0/10, except: [] }, Network { net: 172.16.0.0/12, except: [] }, Network { net: 192.168.0.0/16, except: [] }], authentication: Unauthenticated, name: "proxy-admin" }, 
      Authorization { networks: [Network { net: 127.0.0.1/32, except: [] }, Network { net: ::1/128, except: [] }], authentication: Unauthenticated, name: "default:localhost" }
    ], name: "gateway-proxy-admin" } }), state: AtomicState(2), ref_count_rx: 8, notify_rx: Notify { state: 4, waiters: Mutex(Mutex { data: LinkedList { head: None, tail: None } }) }, notify_tx: Notify { state: 1, waiters: Mutex(Mutex { data: LinkedList { head: Some(0x7fd619cb8d78), tail: Some(0x7fd619cb8d78) } }) } }, version: Version(0) } }
[    29.766730s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=45.79.3.202:60606}: linkerd_app_inbound::policy::authorize::http: Request denied server=gateway-proxy-admin tls=None(NoClientHello) client=45.79.3.202:60606
[    29.766757s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=45.79.3.202:60606}:rescue{client.addr=45.79.3.202:60606}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server gateway-proxy-admin
[    29.766776s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=45.79.3.202:60606}: linkerd_app_core::errors::respond: Handling error on HTTP connection status=403 Forbidden version=HTTP/1.1 close=false
[    29.766794s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=45.79.3.202:60606}:encode_headers: hyper::proto::h1::role: Server::encode status=403, body=None, req_method=Some(GET)

How should the problem be solved?

I would suggest adding[1] a separate server to the proxy on a distinct port. The implementation could occur in a series of steps:

Merge in and release a new proxy with the next stable release, exposing /ready and /live on a new port while maintaining the existing routes on the admin port.
When that feature reaches stable, update the charts and CLI in this repo to use that image, modifying the injector to point probes at that port.
On a timeline agreeable to vendors rolling their own injection method or relying upon the existing /ready and /live routes on the admin server, deprecate those routes.
??? remove /admin and /live from the admin server.

[1] I have done so in these two pull requests:

multicluster gateway: explicitly allow out-of-cluster probes #7548
linkerd/linkerd2-proxy here: app: Implement a separate health check server linkerd2-proxy#1428

Any alternatives you've considered?

In the linkerd community discord, @olix0r has suggested that route-based authorizations, being worked on for a future Linkerd release, would be able to allow this dual role.

My argument in favor of the separate health server are:

The separate server provides defense in depth and least privilege to readiness and liveness probes.
Those routes not requiring an authorization policy mitigates the risk of accidental or temporary deletion of policies exposing admin server routes to the internet
That route-based authorization, even in the presence of a strict deny default cluster authorization, results in a cumbersome and significant additional amount of work for the proxy injector to maintain

1.

Best practices with apps on Kubernetes, and generally, is one of least privilege: a port that only exposes an HTTP server serving /ready is easier to secure than one that also exposes /fireTheMissiles (hyperbole... but only a little.) Separate ports with separate concerns are easily handled using existing tooling, and safely exposed (if the user wishes) using the L3 routing Kubernetes provides by default to containers and via load balancers.

2.

In a default cluster install, the absence of a server authorization is fail open (all-unauthenticated), which means that any mistake removing the server authorization from a gateway will expose privileged routes to the internet. Infrastructure as code could cause a ServerAuthorization to be briefly deleted (replaced), which would make those routes open to the internet. As long as the default authorization policy remains all-unauthenticated, the multicluster gateway exposing the admin port to the internet is a large and risky footgun. Consider the proposed solution versus a route based authorization: which is simpler to maintain?

One may note with my second argument that perhaps the issue is the all-unauthenticated aspect. One could —and I certainly would! — argue that if a cluster operator is running untrusted workloads, running a multi-tenant cluster, and so on, that they should change the default authorization policy. No question there. The risk profile, however, is very different for most cluster operators, and ease of use (for now) dictates that the installation default to an open policy which is simpler for users to deploy and operate.

3.

Suppose that an operator does deploy with a deny default policy, and very carefully manages ServerAuthorizations for all of their workloads. The proxy injector here would have to become not just an injector, but also an operator managing additional authorizations for each workload they inject. Why? Because, going back to the original issue, on clusters such as the one described there, readiness probes come as plain HTTP requests from unpredictable IP addresses.

The proxy injector, in this scenario, would therefore have to add a ServerAuthorization for each workload it injects, authorizing /ready and /live. Or either the default deny route would have to have an asterisk: it is a default deny, except for two routes on port 4191 or cluster operators would have to do so themselves.

How would users interact with this feature?

This feature and/or resolution of this issue should be transparent to any user.

Would you like to work on this feature?

yes

The text was updated successfully, but these errors were encountered:

Related to linkerd#7560, this modifies the proxy injector to use port 4192 and updates the multicluster manifest to match. See: linkerd/linkerd2-proxy#1428 Signed-off-by: Aaron Friel <mayreply@aaronfriel.com>

adleong · 2022-01-04T17:16:10Z

related: #7050

Related to linkerd#7560, this modifies the proxy injector to use port 4192 and updates the multicluster manifest to match. See: linkerd/linkerd2-proxy#1428 Signed-off-by: Aaron Friel <mayreply@aaronfriel.com>

stale · 2022-04-04T23:03:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

kleimkuhler · 2022-08-05T22:10:49Z

As mentioned in the latest comment on #7050, edge-22.8.1 has shipped with the ability to authorized probes by default on default deny clusters. We'll be going through some more testing of this feature, but this should be fixed by those changes.

AaronFriel · 2022-08-08T00:39:49Z

It seems simpler to configure and consistent with more common firewall deployments to allow the readiness probe to be on a separate port.

It looks like the work @olix0r refers to on #7050 would address this, but I do think from an operational perspective, separate ports is just much simpler to manage. And for the sake of compatibility, this PR preserves the existing /live and /ready routes.

app: Implement a separate health check server linkerd2-proxy#1428

I don't see a downside in advertising /live and /ready on two ports. Advanced operators that feel comfortable using a single port for all authorizations can do so, and for most operators with L3/L4 firewalls can easily add defense in depth via port-based firewall rules.

adleong · 2022-08-22T17:44:16Z

Multicluster probes are authorized by default, even when the default policy is deny.

AaronFriel added the enhancement label Jan 3, 2022

AaronFriel mentioned this issue Jan 3, 2022

app: Implement a separate health check server linkerd/linkerd2-proxy#1428

Draft

adleong added this to the stable-2.12.0 milestone Jan 4, 2022

AaronFriel mentioned this issue Jan 4, 2022

multicluster gateway: explicitly allow out-of-cluster probes #7548

Closed

stale bot added the wontfix label Apr 4, 2022

stale bot closed this as completed Apr 19, 2022

olix0r removed the wontfix label Apr 19, 2022

olix0r reopened this Apr 19, 2022

adleong added the priority/P1 Planned for Release label Jul 7, 2022

olix0r self-assigned this Aug 16, 2022

adleong assigned adleong and unassigned olix0r Aug 18, 2022

adleong closed this as completed Aug 22, 2022

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusters with cluster-external control planes cannot start the multicluster gateway, readiness probes are blocked #7560

Clusters with cluster-external control planes cannot start the multicluster gateway, readiness probes are blocked #7560

AaronFriel commented Jan 3, 2022 •

edited

Loading

adleong commented Jan 4, 2022

stale bot commented Apr 4, 2022

kleimkuhler commented Aug 5, 2022

AaronFriel commented Aug 8, 2022

adleong commented Aug 22, 2022

Clusters with cluster-external control planes cannot start the multicluster gateway, readiness probes are blocked #7560

Clusters with cluster-external control planes cannot start the multicluster gateway, readiness probes are blocked #7560

Comments

AaronFriel commented Jan 3, 2022 • edited Loading

What problem are you trying to solve?

How should the problem be solved?

Any alternatives you've considered?

1.

2.

3.

How would users interact with this feature?

Would you like to work on this feature?

adleong commented Jan 4, 2022

stale bot commented Apr 4, 2022

kleimkuhler commented Aug 5, 2022

AaronFriel commented Aug 8, 2022

adleong commented Aug 22, 2022

AaronFriel commented Jan 3, 2022 •

edited

Loading