Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP for graduating nodelocaldns to beta #1005

Merged
merged 2 commits into from
Apr 29, 2019

Conversation

prameshj
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/network Categorizes an issue or PR as relevant to SIG Network. labels Apr 26, 2019

N.B. Although CoreDNS is now the default DNS server on Kubernetes clusters, this document still uses the name kube-dns since the service name is still the same.

Based on the initial feedback for NodeLocal DNSCache feature, HA seems to be the common ask.
Copy link
Member

@thockin thockin Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical. I 100% agree that it is a little different than kubelet or kube-proxy (this is "data plane" vs "control plane"). That said, this proposal is kind of going to heroic lengths to get HA within a single failure-domain. We have a number of node-agents and they are all, more or less, subject to this problem.

What you propose here only reduces the outage window, and carries its own risks. We know that running iptables can consume a lot of memory (yay iptables), which means we have to set aside a significant amount of RAM for this, or it will actually make the system less stable (OOM).

This also removes our ability to do anything "clever" in the cache, such as the autopath proposal. It requires that the pod <-> cache protocol be the same as cache <-> upstream. This makes me sad, because I like the autopath proposal.

Additionally this doesn't work for IPVS, which is a major problem, and it sounds like we don't have a good answer.

Given all this, can we instead do something like SO_REUSEPORT ? If we ensure the cache is SO_REUSEPORT enabled, users who want HA can run 2 copies of the cache (HA, for the low, low price of 2x?), and users who don't care can just ... not ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need HA for Beta? I propose we push it beyond beta. It's way more valuable to get this into more hands ASAP, IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way - HA capability sounds great, but we should not block Beta on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a hard requirement, I put it as a criterion since it was the most asked question about the feature. I am ok with decoupling this with the Beta Graduation criteria and continue working on a solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need HA for Beta?

Does beta mean this feature is enabled by default in new clusters?
Can this be "backported" or will be available only in newer versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature will not be enabled by default on new clusters. It can be backported to older versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is not a bugfix, it probably wouldn't make sense to backport. However, given it has almost no dependencies, we can give users a very easy way to configure their cluster to use it.


1) Pod Evicted - We create this daemonset with `priorityClassName: system-node-critical` setting to greatly reduce the likelihood of eviction.
2) Config error - node-local-dns restarts repeatedly due to incorrect config. This will be resolved only when the config error has been fixed. There will be DNS downtime until then, even though the kube-dns pods might be available.
3) OOMKilled - node-local-dns gets OOMKilled due to its own memory usage or some other component using up all memory resources on the node. There is a chance this will cause other disruptions on the node in addition to DNS downtime though.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some considerations in which number of object types local-dns-cache is memory bound to. This is valuable information for cluster operations.
Best would be to add some measurements.

Copy link
Contributor Author

@prameshj prameshj Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking what components of local-dns-cache take up most of the memory?

I had done some measurements of max-memory usage by the node-local-dns pod when running some dnsperf tests - https://github.com/kubernetes/perf-tests/tree/master/dns. This brought up a single pod running dnsperf and measured memory usage of node-local-dns to be 20 Mi.

I also ran the image you referenced in issue coredns/coredns#2593; mikkeloscar/go-dnsperf:latest with the ubuntu yaml with rps - 10k. I ran 3 replicas in one node and 2 in another. Node-local-dns pods serving those pods had memory usage ~30Mi which is the default limit in node-local-dns yaml.

@justaugustus
Copy link
Member

/assign @thockin @bowei @johnbelamaric

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving now, to get it in, but we should revisit the details of HA.

Specifically we have called out 2 possible HA modes, which warrant testing and documenting (if they work well) and some small affordances:

  1. Flag: Listen on a second IP; don't manage the NOTRACK rule in the cache; allow an external agent to do that.

  2. Flag: enable SO_REUSEPORT; allow a second daemonset per node.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: prameshj, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2019
@k8s-ci-robot k8s-ci-robot merged commit 9891db3 into kubernetes:master Apr 29, 2019
@BSWANG
Copy link

BSWANG commented Jul 17, 2019

@prameshj @thockin
In IPVS backend cluster. Can we use libpcap to capture the kube-dns service's udp request to nodelocaldns? And drop the kube-dns endpoints on node output. Then nodelocaldns directly reply dns response to pod? The dns package flow like this diagram:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants