-
Notifications
You must be signed in to change notification settings - Fork 39.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet should track tcp_mem stats also along with cpu/ram/disk #62334
Comments
/cc @thockin in continuation to our discussion at https://twitter.com/thockin/status/973965476173725696, took me some time to fixup the repro steps 😄 |
Since we are collecting network status information, should connection tracking count/limit be considered? |
@cizixs I think it should. Do you know any other parameters which can cause a network failure/degradation, but easy to detect? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
just ran into this issue, and i'd like that feature as well :) |
Observed the issue in our systems. |
Somewhat tangential to this, but more of an informational thing. From the network perspective, what sort of things are namespaced and what isn't? I am currently trying to debug a "performance" issue and was starting to focus on the network. From my research it appears settings like tcp_rmem and tpc_wmem (read and right buffers) are namespaced. Meaning you can set those values within a container and they don't affect the host settings. But a setting like tcp_mem (which list the max page allocations for tcp stack) seems to only be set at the host level. Yet I would think tcp_mem's setting directly affects what you can set in tcp_rmem and wmem. |
Having same issue on the master node. The resources are not getting deleted as the API server is unable to take new request. Kubelet is showing healthy though.
|
This caught me today, had several Azure AKS, Kubernetes v1.13.5 |
We have run into this issue as well, we had a was leaking open connections which lead to a whole node being unusable and introduced a noisy neighbor problem. |
Recently we experienced an interesting production problem. This application was running on multiple AWS EC2 instances behind Elastic Load Balancer. The application was running on GNU/Linux OS, Java 8, Tomcat 8 application server. All of sudden one of the application instances became unresponsive. All other application instances were handling the traffic properly. Whenever the HTTP request was sent to this application instance from the browser, we were getting following response to be printed on the browser.
Let us see how we resolved this issue by assigning values for these properties in the server:
|
We've had this problem as well on GKE nodes (Container-Optimized OS). It would be great to see Kubernetes handle this as it can effectively break the network stack of an entire node. Slightly off topic, does anyone have any tips for determining which container/process is leaking the TCP memory? As a quick workaround we have increased the TCP memory but that can't work forever. |
/triage accepted |
Workarounds are good, but it's not clear to me if we should be doing more here - anyone who has direct context? |
If a pod is leaking connections the pod will kill the node, without any alert or monitoring. Happy to give more context if you need it. |
This is an old issue, which I won't have time to tackle in the near future - any context you can add here, to make it more approachable by some volunteer (could be you!) would help. |
/remove-lifecycle frozen |
|
/assign |
Still experience a periodic GKE nodes outage because of tcp oom. Pods that are hosted on a deceased node becomes inoperable. |
👍 Thanks for the info. I'm busy working on this. I hope to have a PR created soon. |
/accept |
/triage accepted |
@adrianmoisey: The label In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/triage accepted |
I've been working on this. It's taking me some time, sorry about that. I've gotten my code to mostly work, but I need to spend time finishing up the specifics. When the socket buffer is full, which of these should happen:
Based on this conversation, I assume only bullet point 1 should happen (Node becomes unready). Additionally, does this feature need to be feature gated? |
@adrianmoisey I would prefer "2", Kubelet should evict the pod causing the memory usage, just as it would evict pods exceeding their memory or ephemeralStorage allowance. |
/cc @aojea is tcp_mem accounted as part of the global memory ? is namespaced? what are the behaviours we want to implement, TCP is bursty by nature, what happens if there are peaks of congestion? |
I see a reference here in the Linux kernel documentation: Search for section 2.7.1: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
|
Independently of everything, we depend of the mechanisms exposed by the kernel, based on https://lpc.events/event/16/contributions/1212/attachments/1079/2052/LPC%202022%20-%20TCP%20memory%20isolation.pdf this is still WIP, there are also some interesting lessons learned
|
Interesting share, thanks @aojea
From https://docs.kernel.org/admin-guide/cgroup-v2.html:
This makes me think that it may be possible to evict pods that are using up too much TCP transmission buffers. I'm not sure if it's what we want to do though. From an end user perspective, if kubelet is going to be evicting pods based on some behaviour, I'd like the ability to determine the bounds of what is good and what is bad. (ie: memory limits as an example). Which makes me think that this would work better as a Pod resource, much like memory and CPU. |
Is this accounted per-cgroup or just at the root? It looks like cgroup v2 is per-cgroup. Does that get accumulated into the total memory usage of the container? Obviously, the ideal would be to kill the process/cgroup which is abusive. But it's not easy to know who that is unless it is accounted properly. Anything with a global (machine-wide) limit which is shared by cgroups is likely to be an isolation problem. |
This may be related #116895. It's in the same area ("invisible" pod memory causing oom), but seem to need |
@thockin just a thought: If it is accounted per cgroup, it could be used to evict the culprit pod. But even if not, it could be handled on node level (like DiskPressure, PIDPressure, ..) which could lead to the node being marked as NotReady, or even be drained and removed completely. Any well-designed application could then fail over to other pods. |
If a regular-privilege pod can cause a machine to go NotReady, that's a DoS vector. Now, I know that pods with memory limit > request fall into this category, but that is something an admin can prevent by policy. I am far enough away from kubelet's resource management code now that I am hand-waving. The pattern we want, I think is:
|
I'm going to unassign myself from this issue for now. I've got other tasks I'm working on at the moment, and this one seems to be a little complicated for me right now. I'll happily pick it up in the future, if nobody else has done it. /unassign |
/kind feature
/sig node
What happened:
A program started leaking TCP memory, which filled up the node's TCP stack memory. The network performance on the node degraded and connections to pods running on the node either times out or will hang for a long time.
Node's
dmesg
had lines mentioningTCP: out of memory -- consider tuning tcp_mem
Further reading and investigation reveals that this could happen when TCP stack runs out of memory pages allocated by kernel or when there are lot of orphaned/open sockets.
TCP stack limits: max 86514
$ cat /proc/sys/net/ipv4/tcp_mem 43257 57676 86514 # min pressure max
Usage when issue happened: mem 87916
kubelet posts node status as ready.
What you expected to happen:
kubelet should say node is not ready.
It would be great if
kubelet
could track thetcp_mem
stats also along with CPU/RAM/disk as network is also an important factor. Iftcp_mem
limit is hit, for some reason, the node is not usable. Notifying the user that node has some issue can help debugging and further identifying the cause.How to reproduce it (as minimally and precisely as possible):
cat /proc/sys/net/ipv4/tcp_mem
andcat /proc/net/sockstat
and scale the deployment until the current mem exceeds the limitAnything else we need to know?:
This is more of a feature request for kubelet rather than a bug. TCP mem can get filled if the node is running a lot of TCP heavy workloads, need not necessarily be a leak. Since kubelet is ultimately responsible for reporting node's health, network should also be a parameter.
Environment:
kubectl version
):v1.9.6-gke.0
uname -a
): 4.4.111+The text was updated successfully, but these errors were encountered: