Skip to content

Add a KEP for DNS Autopath in pod API #967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 228 additions & 0 deletions keps/sig-network/2009-dns-autopath/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
#
# KEP-2009: Autopath for DNS

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [New EDNS0 Option](#new-edns0-option)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Alternatives](#alternatives)
<!-- /toc -->

## Release Signoff Checklist

<!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.

For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.

Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ x ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] (R) Graduation criteria is in place
- [ ] (R) Production readiness review completed
- [ ] Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes


[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary

DNS Search Path expansion on k8s is one of the top reasons for high DNS latency and query volume. This proposal aims to minimize the number of parallel DNS queries generated by a pod by moving searchpath expansion logic to the DNS server side.

## Motivation

DNS Search Path expansion on pods using ClusterFirst DNS mode can lead to DNS latency issues and race conditions due to several parallel (The queries are sent in parallel in musl) dns queries from the same pod. This is especially problematic for non-cluster service lookups, where all expanded queries will be NXDOMAINs. Even in pods using glibc which sends these requests serially, the reduced load on client resolver and reduction in client latency is a big motivation to move this logic to the server-side.
The search path currently includes:

1. "$NS.svc.$SUFFIX"
2. "svc.$SUFFIX"
3. "$SUFFIX"
4. Host level suffixes, which might be 2 or 3 in number.

Where $NS stands for the namespace that the pod belongs to, $SUFFIX is the Kubernetes cluster suffix.

These search paths are set to make sure:

1. Pods can discover Services in the same namespace using just the service name.
2. Pods can discover Services across namespaces using shorthand of the form "$SVCNAME.$NSNAME"
3. Pods can discover resources within the same cluster.

These searchpaths are included in pods' /etc/resolv.conf by kubelet and are enforced by setting ndots to 5. This means any hostname lookups with fewer than 5 dots will be expanded using all the search paths listed.

When pod issues a query to lookup hostname "service123", it is expanded to 6 queries - one for the original hostname and one with each of the searchpaths appended. Some resolvers issue both A and AAAA queries, so this can be a total of 12 or more queries for every single DNS lookup. When these queries are issued in parallel, they end up at the node with the same source tuple and need to be DNAT'ed increasing the chance of a [netfilter race condition](https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pod issues a query to lookup hostname "service123", it is expanded to 6 queries

Generally (always?) a client will terminate searching when it gets something other than NXDOMAIN. E.g.

  • If "service123" matches a service name in the client's namespace, then it is only expanded to 1 query (2 for dual stack).
  • If "service123" matches a namespace in the cluster, then it is expanded to 2 queries (4 for dual stack).

Generally speaking, types queries that must traverse though the entire search path are of the following types:

  • non-fqdn names external to the cluster, e.g. "google.com" (probably the most common occurance)
  • almost-fqdn cluster service names missing trailing dot - e.g. "service123.namespace123.svc.cluster.local"
  • cluster service queries for services that don't exist or otherwise don't have a DNS entry.

Copy link
Contributor Author

@prameshj prameshj Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "service123" matches a service name in the client's namespace, then it is only expanded to 1 query (2 for dual stack).

I assume this is the case if the searchpath-expanded queries are sent one by one? If they all go out in parallel, the client will wait for all to return, even if one of them got a valid response?

Copy link

@chrisohaver chrisohaver Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they all go out in parallel, the client will wait for all to return, even if one of them got a valid response?

Are there clients that implement searching all domains in the list in parallel? I have only seen sequential (with A and AAAA in parallel for each domain).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right.. I looked at the pcap from an alpine client just now. Looks like A and AAAA reuse the same port and get sent at the same time. Each searchpath expanded query is being sent sequentially. Also, searchpath expansion does terminate once a valid response is received. Thanks for pointing this out, i will update this section. Really useful to see the 3 cases worst affected by this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought one of the libcs would do them in parallel, but forget which, regardless I think it's not illegal to do so, with reasonable respect for ordering when you get multiple responses

Copy link
Member

@aojea aojea Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, seems that muslc parallelize the nameserver`s queries, not the search domains

Traditional resolvers, including glibc’s, make use of multiple nameserver lines in resolv.conf by trying each one in sequence and falling to the next after one times out. musl’s resolver queries them all in parallel and accepts whichever response arrives first.

https://andydote.co.uk/2019/12/30/consul-alpine-dns-revisited/

If a response is not received for one of the queries, the DNS lookup on the client side will fail after a 5s timeout. This issue is most-applicable to A/AAAA lookups which form the bulk of DNS lookups. PTR Records are not subject to searchpath expansion(since they always have >5 dots).

### Goals

* Provide a solution to minimize number of DNS queries on the client side, for a DNS lookup.
This solution will preserve the current behavior that allows shortname lookups of Kubernetes resources.

* Make this solution configurable via the k8s API.

### Non-Goals
Modifying the current Kubernetes DNS Schema is not a goal of this KEP.


## Proposal

This proposal introduces use of a single searchpath by client pods when performing a DNS lookup. This searchpath will contain all the information needed to figure out the list of searchpaths to apply for that DNS lookup.

As observed from the list of searchpaths in the current DNS schema, the searchpaths used for each pod is the same, except for the namespace. If this information can be included in a single searchpath, some entity that is aware of the DNS schema can expand this to a list of searchpaths. This entity could be the ClusterDNS server(CoreDNS), NodeLocal DNSCache or a sidecar container running in the client pod.

The new searchpath is of the form:
`search.$NS.$SUFFIX.ap.k8s.io`, where $NS is the namespace of the pod and $SUFFIX is the cluster suffix. ap.k8s.io is the delimiter to identify if a query needs search expansion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why SUFFIX is needed - I guess it is to allow servers to handle multiple clusters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem: any DNS processor who ISN'T aware of this magic might route the query to the real k8s.io zone servers. That's an info leak, at best and more likely it's a CVE.

I think you would have to preserve the suffix, so maybe $NS.autopath.$SUFFIX instead? That should route to the cluster zone and not leak out into the world. "autopath" could be "ap" or "search" or "srch" or so something.


This searchpath can be set by using `dnsPolicy:None` to start with, as described [here](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config).
Once this feature graduates to GA, the default searchpath for `ClusterFirst` and `ClusterFirstWithHostNet` will include this new searchpath instead of the 3 different k8s specific searchpaths.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how we can do that - we need to know that the cluster's DNS resolver supports it (either query parsing or or EDNS0) so it seems like it has to be opt-in basically forever. It might be possible to change the default behavior on a cluster-by-cluster basis but not globally.


With this new searchpath, when a clientPod in the "default" namespace looks up a service - "mytestsvc”:

1) The query will be sent out as `mytestsvc.search.default.cluster.local.ap.k8s.io`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a potential DNS amplification attack vector? (or no?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a rogue pod can achieve more amplification by using the current set of searchpaths today by running using Alpine base image and using web apps that do searchpath expansion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With autopath, a bad-acting client can cause more total load to be generated with less CPU because "the system" is doing more work that was previously done by the client,


2) clusterDNS server(coreDNS by default) receives this and strips off the delimiter - "ap.k8s.io" and identifies "search" as start of the custom searchpath. The namespace and cluster suffix can be obtained from rest of the string. The DNS server can now construct the full service name for this query as - `mytestsvc.default.svc.cluster.local`.
Copy link

@chrisohaver chrisohaver Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the fact that the pod subdomain was recently removed from the DNS Spec, I don't think we should always assume the type of the query is svc ... IOW the original query should contain that information... e.g. mytestsvc.search.default.svc.cluster.local.ap.k8s.io

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.. looks like it should be part of the custom searchpath suffix then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why. The "pod" subdom was never part of any standard search, so why do we need to accomodate it?


This approach minimizes the number of DNS queries at client side to atmost 2(A, AAAA). The searchpath expansion logic moves to the server side.

However, the above approach requires the clusterDNS server to know about the k8s DNS schema as well as the syntax of this custom searchpath. It does know about k8s DNS schema anyway, but knowledge of the custom searchpath is an additional requirement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we would certainly have to version this, so that it becomes at least search.$VERSION.$NS.$SUFFIX.ap.k8s.io - this allows us to eventually change schema if we need to. In fact we might choose to distinguish "DNS schema version" from "well-known search-expansion set".

E.g. even in the v1 DNS schema we all know and love, It might be valuable to change the search-expansion to drop the #2 search (dodging the "namespace com") ambiguity.


### New EDNS0 Option

As an alternative, we introduce an EDNS0 option which will include a list of
searchpaths to be applied to the base query.

A new EDNS0 option - SearchPaths is introduced, with an option code from the [experimental range - 65001 to 65534](https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-11).
The value of this option will be a comma-separated string consisting of all the searchpaths which are to be appended to the main query and looked up. This option can be useful outside of Kubernetes as well.


Instead of having to modify all client pod images to insert an EDNS0 option in their requests, a new CoreDNS plugin "gensearchpaths" will be introduced. This will generate a list of search paths as an EDNS0 option to a given query.

This plugin can be used by [Nodelocal DNSCache](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0030-nodelocal-dns-cache.md), which is a daemonset running a per-node DNS Cache. It can also be used by a sidecar container attached to a pod, to selectively use this feature. The sidecar container will run a stripped-down version of CoreDNS with the above plugin.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sidecar will basically be a CoreDNS instance running "gensearchpaths" plugin. It will use the output from that to populate the new EDNS0 option. So, this sidecar determines the namespace by parsing the searchpath in /etc/resolv.conf which will be set to "search...ap.k8s.io".

Good catch about determining namespace from a sidecar, i thought this was exposed in an env var already. In this case though, I think we can get awat with the limitation.



The clusterDNS service needs to support this new EDNS0 option and lookup multiple query names by expanding a single incoming query. This was tried on a test setup by [modifying the autopath plugin](https://github.com/coredns/coredns/compare/master...prameshj:auto) in CoreDNS to extract the searchpath from an EDNS0 option, for a proof of concept.

If upstream ClusterDNS uses something other than CoreDNS, support for this EDNS0 option should be added for autopath to work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/should/must


### Risks and Mitigations

1) If NodeLocal DNSCache is responsible for adding the EDNS0 option, DNS resolution can break if NodeLocal DNSCache is down or if the pods configured with the special searchpath point to the kube-dns service directly to resolve query names. This is because without the EDNS0 option, the custom searchpath is not resolvable by kube-dns/CoreDNS. Running 2 DNSCache instances would be necessary to keep searchpath expansion working during upgrades.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant limitation, IMO, and could make it infeasible to rely on the proxy.


2) Increases size of DNS requests due to the extra EDNS0 option. This can result in the query getting upgraded to TCP automatically. This is not an issue when using NodeLocal DNSCache which upgrades connections to TCP by default, for cluster names.

3) If the EDNS0 option is set and sent to a server that does not support the option, queries will fail. However, this mode is enabled in podSpec by the user and not turned on by default.


## Design Details

There are 2 CoreDNS plugins involved in making the new DNSPolicy work:

1) autopath (needs change to lookup paths from EDNS0 option)
2) gensearchpath - new plugin to generate searchpaths as an EDNS0 option and attach to main query
This takes 3 parameters - autopath suffix, cluster suffix and schema version.

NodeLocal DNSCache or any sidecar that generates searchpaths needs config blocks similar to :

(only relevant plugins included in this example)

```
< Add any custom stubdomains config here>

ap.k8s.io:53 {
cache
gensearchpath ap.k8s.io cluster.local v1 {
forward <kube-dns service IP>
fallthrough
}
forward . /etc/resolv.conf
}

cluster.local:53 {
cache
forward <kube-dns service IP>
}

:53 {
cache
forward . /etc/resolv.conf
}
```


Let us consider a few scenarios and how DNS resolution will work.

1) Client pod looks up - "mytestservice", but did not request searchpath expansion.

This can happen if user issued a command like

`dig mytestservice`

This query is expected to fail since it is not fully qualified and the service name does not exist as-such. Query hits the 3rd block above and is looked up by the external DNS server. No EDNS0 options get added.

2) Client pod looks up - "mytestservice" with searchpath expansion. Query is sent as `mytestservice.search.default.cluster.local.ap.k8s.io` and hits the first block. The namespace and suffix are extracted and the EDNS0 option contains "default.svc.cluster.local, svc.cluster.local, cluster.local,<other searchpaths>". Assuming service name is found, no fallthrough is needed.

The same steps apply when a client pod looks up any other service name with part or even all of the FQDN.

3) Client pod looks up "google.com" with searchpath expansion. Query is received as `google.com.search.default.cluster.local.ap.k8s.io`. The searchpath expanded queries have no response, so "google.com" fallthrough to external DNS Server without EDNS0 option.

4) A service name FQDN "mytestservice.default.svc.cluster.local" is looked up, ndots is set to 4.
No searchpath expansion, query hits the 2nd block "cluster.local" and is forwarded to kube-dns service.
5) Queries matching a custom stubDomain will match their own config block and not be subject to this searchpath expansion.

There are 3 parts to the implementation:

1) Add a new CoreDNS plugin to generate searchpaths and attach EDNS0 option to a query.

2) Enhance autopath plugin to read searchpaths from this new EDNS0 option.

3) Enhance autopath to also parse the new searchpath format - `search.$NS.$SUFFIX.ap.k8s.io`. Doing this ensures that DNS resolution works even if user does not have a sidecar or node-local-dns pod running.

### Test Plan

The existing DNS conformance tests will be run against clusters with pods using the new "clusterFirstWithAutopath" dnsPolicy.

## Graduation Criteria

In order to graduate to beta, we will need:

Conformance tests exercising this new searchpath.
Verify performance/scalability via automated testing and measure the improvement over the existing `clusterFirst` mode.
Copy link
Member

@pacoxu pacoxu Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some user cases, we want to enable autopath in some of our clusters, however, we still want a benchmark to compare with enable autopath and disable autopath to make sure that there is no performance decrease after we enable it.

The benchmark is appreciated when promoting to beta or GA.


## Implementation History

* <TBD> - Implementable version merged
* 2019-04-15 - Initial discussion around this KEP

## Alternatives

* Use current autopath plugin in CoreDNS and set a single searchpath in podSpec. This approach requires watching all pods to map the pod namespace and ip address of the pod. The namespace of the pod can be determined from the source ip in the DNS request, as a result of this mapping. This additional watch can be resource intensive and also is a solution specific to CoreDNS.
*
* Introduce a new dnsPolicy `clusterFirstWithAutopath` that sets this custom searchpath automatically.

* An optimization in the EDNS0 option could be to include a version number and metadata and move the logic of determining searchpaths to the server side. The benefit of this approach is that the size overhead of the EDNS0 option is minimal. But, it requires the server to have logic to compute searchpaths in addition to performing the expanded queries. Also, that makes the EDNS0 option kubernetes-specific.
24 changes: 24 additions & 0 deletions keps/sig-network/2009-dns-autopath/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: DNS Autopath in PodSpec
kep-number: 2009
authors:
- "@prameshj"
owning-sig: sig-network
participating-sigs:
- sig-network
status: implementable
creation-date: 2019-04-15
last-updated: 2020-09-23
reviewers:
- "@thockin"
- "@bowei"
- "@johnbelamaric"
approvers:
- "@thockin"
- "@bowei"
- "@johnbelamaric"
stage: alpha
latest-milestone: "v1.21"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to 1.20 even though you intend to work on it in the 1.21 release cycle also, this field is used to signal in which milestone latest work was done.

milestone:
alpha: "v1.21"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this back to 1.20 and let's change this after the 6th of October if this KEP doesnt get into this release.

---