kubernetes · prameshj · Apr 16, 2019 · Apr 22, 2019 · Oct 28, 2019 · Nov 2, 2019
diff --git a/keps/sig-network/2009-dns-autopath/README.md b/keps/sig-network/2009-dns-autopath/README.md
@@ -0,0 +1,228 @@
+# 
+# KEP-2009: Autopath for DNS
+
+<!-- toc -->
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [New EDNS0 Option](#new-edns0-option)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Test Plan](#test-plan)
+- [Graduation Criteria](#graduation-criteria)
+- [Implementation History](#implementation-history)
+- [Alternatives](#alternatives)
+<!-- /toc -->
+
+## Release Signoff Checklist
+
+<!--
+**ACTION REQUIRED:** In order to merge code into a release, there must be an
+issue in [kubernetes/enhancements] referencing this KEP and targeting a release
+milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
+of the targeted release**.
+
+For enhancements that make changes to code or processes/procedures in core
+Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
+Signoff checklist to be completed.
+
+Check these off as they are completed for the Release Team to track. These
+checklist items _must_ be updated for the enhancement to be released.
+-->
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ x ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
+- [ ] (R) Graduation criteria is in place
+- [ ] (R) Production readiness review completed
+- [ ] Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
+[kubernetes/website]: https://git.k8s.io/website
+
+## Summary
+
+DNS Search Path expansion on k8s is one of the top reasons for high DNS latency and query volume. This proposal aims to minimize the number of parallel DNS queries generated by a pod by moving searchpath expansion logic to the DNS server side. 
+
+## Motivation
+
+DNS Search Path expansion on pods using ClusterFirst DNS mode can lead to DNS latency issues and race conditions due to several parallel (The queries are sent in parallel in musl) dns queries from the same pod. This is especially problematic for non-cluster service lookups, where all expanded queries will be NXDOMAINs. Even in pods using glibc which sends these requests serially, the reduced load on client resolver and reduction in client latency is a big motivation to move this logic to the server-side. 
+The search path currently includes: 
+
+1. "$NS.svc.$SUFFIX"
+2. "svc.$SUFFIX"
+3. "$SUFFIX"
+4. Host level suffixes, which might be 2 or 3 in number.
+
+Where $NS stands for the namespace that the pod belongs to, $SUFFIX is the Kubernetes cluster suffix.
+
+These search paths are set to make sure: 
+
+ 1. Pods can discover Services in the same namespace using just the service name.
+ 2. Pods can discover Services across namespaces using shorthand of the form "$SVCNAME.$NSNAME"
+ 3. Pods can discover resources within the same cluster.
+
+These searchpaths are included in pods' /etc/resolv.conf by kubelet and are enforced by setting ndots to 5. This means any hostname lookups with fewer than 5 dots will be expanded using all the search paths listed.
+
+When pod issues a query to lookup hostname "service123", it is expanded to 6 queries - one for the original hostname and one with each of the searchpaths appended. Some resolvers issue both A and AAAA queries, so this can be a total of 12 or more queries for every single DNS lookup. When these queries are issued in parallel, they end up at the node with the same source tuple and need to be DNAT'ed increasing the chance of a [netfilter race condition](https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts).
+If a response is not received for one of the queries, the DNS lookup on the client side will fail after a 5s timeout. This issue is most-applicable to A/AAAA lookups which form the bulk of DNS lookups. PTR Records are not subject to searchpath expansion(since they always have >5 dots).
+
+### Goals
+
+* Provide a solution to minimize number of DNS queries on the client side, for a DNS lookup. 
+This solution will preserve the current behavior that allows shortname lookups of Kubernetes resources.
+
+* Make this solution configurable via the k8s API.
+
+### Non-Goals
+Modifying the current Kubernetes DNS Schema is not a goal of this KEP.
+
+
+## Proposal
+
+This proposal introduces use of a single searchpath by client pods when performing a DNS lookup. This searchpath will contain all the information needed to figure out the list of searchpaths to apply for that DNS lookup. 
+
+As observed from the list of searchpaths in the current DNS schema, the searchpaths used for each pod is the same, except for the namespace. If this information can be included in a single searchpath, some entity that is aware of the DNS schema can expand this to a list of searchpaths. This entity could be the ClusterDNS server(CoreDNS), NodeLocal DNSCache or a sidecar container running in the client pod.
+
+The new searchpath is of the form: 
+`search.$NS.$SUFFIX.ap.k8s.io`, where $NS is the namespace of the pod and $SUFFIX is the cluster suffix. ap.k8s.io is the delimiter to identify if a query needs search expansion.
+
+This searchpath can be set by using `dnsPolicy:None` to start with, as described [here](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config).
+Once this feature graduates to GA, the default searchpath for `ClusterFirst` and `ClusterFirstWithHostNet` will include this new searchpath instead of the 3 different k8s specific searchpaths.
+
+With this new searchpath, when a clientPod in the "default" namespace looks up a service - "mytestsvc”:
+
+1) The query will be sent out as `mytestsvc.search.default.cluster.local.ap.k8s.io`. 
+
+2) clusterDNS server(coreDNS by default) receives this and strips off the delimiter - "ap.k8s.io" and identifies "search" as start of the custom searchpath. The namespace and cluster suffix can be obtained from rest of the string. The DNS server can now construct the full service name for this query as - `mytestsvc.default.svc.cluster.local`. 
+
+This approach minimizes the number of DNS queries at client side to atmost 2(A, AAAA). The searchpath expansion logic moves to the server side.
+
+However, the above approach requires the clusterDNS server to know about the k8s DNS schema as well as the syntax of this custom searchpath. It does know about k8s DNS schema anyway, but knowledge of the custom searchpath is an additional requirement.
+
+### New EDNS0 Option
+
+As an alternative, we introduce an EDNS0 option which will include a list of
+searchpaths to be applied to the base query. 
+
+A new EDNS0 option - SearchPaths is introduced, with an option code from the [experimental range - 65001 to 65534](https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-11).
+The value of this option will be a comma-separated string consisting of all the searchpaths which are to be appended to the main query and looked up. This option can be useful outside of Kubernetes as well.
+
+
+Instead of having to modify all client pod images to insert an EDNS0 option in their requests, a new CoreDNS plugin "gensearchpaths" will be introduced. This will generate a list of search paths as an EDNS0 option to a given query.
+
+This plugin can be used by [Nodelocal DNSCache](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0030-nodelocal-dns-cache.md), which is a daemonset running a per-node DNS Cache. It can also be used by a sidecar container attached to a pod, to selectively use this feature. The sidecar container will run a stripped-down version of CoreDNS with the above plugin.
+
+
+The clusterDNS service needs to support this new EDNS0 option and lookup multiple query names by expanding a single incoming query. This was tried on a test setup by [modifying the autopath plugin](https://github.com/coredns/coredns/compare/master...prameshj:auto) in CoreDNS to extract the searchpath from an EDNS0 option, for a proof of concept.
+
+If upstream ClusterDNS uses something other than CoreDNS, support for this EDNS0 option should be added for autopath to work. 
+
+### Risks and Mitigations
+
+1) If NodeLocal DNSCache is responsible for adding the EDNS0 option, DNS resolution can break if NodeLocal DNSCache is down or if the pods configured with the special searchpath point to the kube-dns service directly to resolve query names. This is because without the EDNS0 option, the custom searchpath is not resolvable by kube-dns/CoreDNS. Running 2 DNSCache instances would be necessary to keep searchpath expansion working during upgrades.
+
+2) Increases size of DNS requests due to the extra EDNS0 option. This can result in the query getting upgraded to TCP automatically. This is not an issue when using NodeLocal DNSCache which upgrades connections to TCP by default, for cluster names.
+
+3) If the EDNS0 option is set and sent to a server that does not support the option, queries will fail. However, this mode is enabled in podSpec by the user and not turned on by default.
+
+
+## Design Details
+
+There are 2 CoreDNS plugins involved in making the new DNSPolicy work: 
+
+	1) autopath (needs change to lookup paths from EDNS0 option)
+	2) gensearchpath - new plugin to generate searchpaths as an EDNS0 option and attach to main query
+	   This takes 3 parameters - autopath suffix, cluster suffix and schema version.
+
+NodeLocal DNSCache or any sidecar that generates searchpaths needs config blocks similar to :
+
+(only relevant plugins included in this example)
+
+```
+< Add any custom stubdomains config here>
+
+ap.k8s.io:53 {
+  cache
+  gensearchpath ap.k8s.io cluster.local v1 {
+    forward <kube-dns service IP>
+    fallthrough
+  }
+  forward . /etc/resolv.conf
+}
+
+cluster.local:53 {
+  cache
+  forward <kube-dns service IP>
+}
+
+:53 {
+  cache
+  forward . /etc/resolv.conf
+}
+```
+
+
+Let us consider a few scenarios and how DNS resolution will work.
+
+1) Client pod looks up - "mytestservice", but did not request searchpath expansion.
+
+ This can happen if user issued a command like 
+
+ `dig mytestservice`
+
+ This query is expected to fail since it is not fully qualified and the service name does not exist as-such. Query hits the 3rd block above and is looked up by the external DNS server. No EDNS0 options get added.
+
+2) Client pod looks up - "mytestservice" with searchpath expansion. Query is sent as `mytestservice.search.default.cluster.local.ap.k8s.io` and hits the first block. The namespace and suffix are extracted and the EDNS0 option contains "default.svc.cluster.local, svc.cluster.local, cluster.local,<other searchpaths>". Assuming service name is found, no fallthrough is needed.
+
+The same steps apply when a client pod looks up any other service name with part or even all of the FQDN.
+
+3) Client pod looks up "google.com" with searchpath expansion. Query is received as `google.com.search.default.cluster.local.ap.k8s.io`. The searchpath expanded queries have no response, so "google.com" fallthrough to external DNS Server without EDNS0 option.
+
+4) A service name FQDN "mytestservice.default.svc.cluster.local" is looked up, ndots is set to 4.
+No searchpath expansion, query hits the 2nd block "cluster.local" and is forwarded to kube-dns service. 
+5) Queries matching a custom stubDomain will match their own config block and not be subject to this searchpath expansion.
+
+There are 3 parts to the implementation:
+
+1) Add a new CoreDNS plugin to generate searchpaths and attach EDNS0 option to a query. 
+
+2) Enhance autopath plugin to read searchpaths from this new EDNS0 option. 
+
+3) Enhance autopath to also parse the new searchpath format - `search.$NS.$SUFFIX.ap.k8s.io`. Doing this ensures that DNS resolution works even if user does not have a sidecar or node-local-dns pod running.
+
+### Test Plan
+
+The existing DNS conformance tests will be run against clusters with pods using the new "clusterFirstWithAutopath" dnsPolicy.
+
+## Graduation Criteria
+
+In order to graduate to beta, we will need:
+
+Conformance tests exercising this new searchpath.
+Verify performance/scalability via automated testing and measure the improvement over the existing `clusterFirst` mode.
+
+## Implementation History
+
+* <TBD> - Implementable version merged
+* 2019-04-15 - Initial discussion around this KEP
+
+## Alternatives
+
+* Use current autopath plugin in CoreDNS and set a single searchpath in podSpec. This approach requires watching all pods to map the pod namespace and ip address of the pod. The namespace of the pod can be determined from the source ip in the DNS request, as a result of this mapping. This additional watch can be resource intensive and also is a solution specific to CoreDNS.
+*
+* Introduce a new dnsPolicy `clusterFirstWithAutopath` that sets this custom searchpath automatically.
+
+* An optimization in the EDNS0 option could be to include a version number and metadata and move the logic of determining searchpaths to the server side. The benefit of this approach is that the size overhead of the EDNS0 option is minimal. But, it requires the server to have logic to compute searchpaths in addition to performing the expanded queries. Also, that makes the EDNS0 option kubernetes-specific.
diff --git a/keps/sig-network/2009-dns-autopath/kep.yaml b/keps/sig-network/2009-dns-autopath/kep.yaml
@@ -0,0 +1,24 @@
+---
+title: DNS Autopath in PodSpec
+kep-number: 2009 
+authors:
+  - "@prameshj"
+owning-sig: sig-network
+participating-sigs:
+  - sig-network
+status: implementable
+creation-date: 2019-04-15
+last-updated: 2020-09-23
+reviewers:
+  - "@thockin"
+  - "@bowei"
+  - "@johnbelamaric"
+approvers:
+  - "@thockin"
+  - "@bowei"
+  - "@johnbelamaric"
+stage: alpha
+latest-milestone: "v1.21"
+milestone:
+  alpha: "v1.21"
+---