-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP for graduating nodelocaldns to beta #1005
Conversation
8f8e9f5
to
8407130
Compare
|
||
N.B. Although CoreDNS is now the default DNS server on Kubernetes clusters, this document still uses the name kube-dns since the service name is still the same. | ||
|
||
Based on the initial feedback for NodeLocal DNSCache feature, HA seems to be the common ask. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical. I 100% agree that it is a little different than kubelet or kube-proxy (this is "data plane" vs "control plane"). That said, this proposal is kind of going to heroic lengths to get HA within a single failure-domain. We have a number of node-agents and they are all, more or less, subject to this problem.
What you propose here only reduces the outage window, and carries its own risks. We know that running iptables can consume a lot of memory (yay iptables), which means we have to set aside a significant amount of RAM for this, or it will actually make the system less stable (OOM).
This also removes our ability to do anything "clever" in the cache, such as the autopath proposal. It requires that the pod <-> cache protocol be the same as cache <-> upstream. This makes me sad, because I like the autopath proposal.
Additionally this doesn't work for IPVS, which is a major problem, and it sounds like we don't have a good answer.
Given all this, can we instead do something like SO_REUSEPORT ? If we ensure the cache is SO_REUSEPORT enabled, users who want HA can run 2 copies of the cache (HA, for the low, low price of 2x?), and users who don't care can just ... not ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need HA for Beta? I propose we push it beyond beta. It's way more valuable to get this into more hands ASAP, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way - HA capability sounds great, but we should not block Beta on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a hard requirement, I put it as a criterion since it was the most asked question about the feature. I am ok with decoupling this with the Beta Graduation criteria and continue working on a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need HA for Beta?
Does beta mean this feature is enabled by default in new clusters?
Can this be "backported" or will be available only in newer versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature will not be enabled by default on new clusters. It can be backported to older versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is not a bugfix, it probably wouldn't make sense to backport. However, given it has almost no dependencies, we can give users a very easy way to configure their cluster to use it.
|
||
1) Pod Evicted - We create this daemonset with `priorityClassName: system-node-critical` setting to greatly reduce the likelihood of eviction. | ||
2) Config error - node-local-dns restarts repeatedly due to incorrect config. This will be resolved only when the config error has been fixed. There will be DNS downtime until then, even though the kube-dns pods might be available. | ||
3) OOMKilled - node-local-dns gets OOMKilled due to its own memory usage or some other component using up all memory resources on the node. There is a chance this will cause other disruptions on the node in addition to DNS downtime though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see some considerations in which number of object types local-dns-cache is memory bound to. This is valuable information for cluster operations.
Best would be to add some measurements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking what components of local-dns-cache take up most of the memory?
I had done some measurements of max-memory usage by the node-local-dns pod when running some dnsperf tests - https://github.com/kubernetes/perf-tests/tree/master/dns. This brought up a single pod running dnsperf and measured memory usage of node-local-dns to be 20 Mi.
I also ran the image you referenced in issue coredns/coredns#2593; mikkeloscar/go-dnsperf:latest with the ubuntu yaml with rps - 10k. I ran 3 replicas in one node and 2 in another. Node-local-dns pods serving those pods had memory usage ~30Mi which is the default limit in node-local-dns yaml.
/assign @thockin @bowei @johnbelamaric |
b2883d1
to
6bbf8a6
Compare
6bbf8a6
to
3877b49
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving now, to get it in, but we should revisit the details of HA.
Specifically we have called out 2 possible HA modes, which warrant testing and documenting (if they work well) and some small affordances:
-
Flag: Listen on a second IP; don't manage the NOTRACK rule in the cache; allow an external agent to do that.
-
Flag: enable SO_REUSEPORT; allow a second daemonset per node.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: prameshj, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@thockin @bowei @johnbelamaric