kubelet should track tcp_mem stats also along with cpu/ram/disk #62334

shahidhk · 2018-04-10T11:46:31Z

/kind feature
/sig node

What happened:

A program started leaking TCP memory, which filled up the node's TCP stack memory. The network performance on the node degraded and connections to pods running on the node either times out or will hang for a long time.

Node's dmesg had lines mentioning TCP: out of memory -- consider tuning tcp_mem

Further reading and investigation reveals that this could happen when TCP stack runs out of memory pages allocated by kernel or when there are lot of orphaned/open sockets.

TCP stack limits: max 86514

$  cat /proc/sys/net/ipv4/tcp_mem
43257	57676	86514
# min pressure max

Usage when issue happened: mem 87916

$ cat /proc/net/sockstat
sockets: used 1386
TCP: inuse 24 orphan 0 tw 58 alloc 863 mem 87916
UDP: inuse 3 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

kubelet posts node status as ready.

What you expected to happen:

kubelet should say node is not ready.

It would be great if kubelet could track the tcp_mem stats also along with CPU/RAM/disk as network is also an important factor. If tcp_mem limit is hit, for some reason, the node is not usable. Notifying the user that node has some issue can help debugging and further identifying the cause.

How to reproduce it (as minimally and precisely as possible):

Create a GKE cluster
Label a node to carry out tests

$ kubectl label node <node-name> node=leak-test

Create an nginx deployment with Loadbalancer, with can serve a large file

$ kubectl create -f https://raw.githubusercontent.com/shahidhk/k8s-tcp-mem-leak/master/nginx.yaml

Check if you can download the large file

$ curl -o large-file <ip>/large-file

Create a deployment that can fill up the TCP stack memory

$ kubectl create -f https://raw.githubusercontent.com/shahidhk/k8s-tcp-mem-leak/master/leak-repro.yaml

SSH into the node and observe cat /proc/sys/net/ipv4/tcp_mem and cat /proc/net/sockstat and scale the deployment until the current mem exceeds the limit
Try downloading the large file again. It will either become very slow or will not happen at all

Anything else we need to know?:

This is more of a feature request for kubelet rather than a bug. TCP mem can get filled if the node is running a lot of TCP heavy workloads, need not necessarily be a leak. Since kubelet is ultimately responsible for reporting node's health, network should also be a parameter.

Environment:

Kubernetes version (use kubectl version): v1.9.6-gke.0
Cloud provider or hardware configuration: GKE, 1 node, n1-standard-1
OS (e.g. from /etc/os-release): Container-Optimized OS
Kernel (e.g. uname -a): 4.4.111+
Install tools: -
Others: -

The text was updated successfully, but these errors were encountered:

shahidhk · 2018-04-10T11:51:23Z

/cc @thockin in continuation to our discussion at https://twitter.com/thockin/status/973965476173725696, took me some time to fixup the repro steps 😄

cizixs · 2018-04-11T09:54:26Z

Since we are collecting network status information, should connection tracking count/limit be considered?

shahidhk · 2018-04-16T14:35:14Z

@cizixs I think it should. Do you know any other parameters which can cause a network failure/degradation, but easy to detect?

fejta-bot · 2018-07-15T14:58:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-08-14T15:44:21Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

jaredallard · 2018-09-05T22:25:05Z

/remove-lifecycle rotten

fejta-bot · 2018-12-04T23:24:34Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

utkarshmani1997 · 2018-12-13T08:09:53Z

/remove-lifecycle stale

AWKIF · 2019-01-29T09:23:06Z

just ran into this issue, and i'd like that feature as well :)

somashekhar · 2019-03-01T09:02:12Z

Observed the issue in our systems.
Looks like this feature would help lot.

cyrus-mc · 2019-03-13T19:46:33Z

Somewhat tangential to this, but more of an informational thing. From the network perspective, what sort of things are namespaced and what isn't? I am currently trying to debug a "performance" issue and was starting to focus on the network.

From my research it appears settings like tcp_rmem and tpc_wmem (read and right buffers) are namespaced. Meaning you can set those values within a container and they don't affect the host settings.

But a setting like tcp_mem (which list the max page allocations for tcp stack) seems to only be set at the host level. Yet I would think tcp_mem's setting directly affects what you can set in tcp_rmem and wmem.

anjuls · 2019-03-29T06:17:28Z

Having same issue on the master node. The resources are not getting deleted as the API server is unable to take new request. Kubelet is showing healthy though.


[1667891.052298] TCP: out of memory -- consider tuning tcp_mem
[1668316.259318] TCP: out of memory -- consider tuning tcp_mem
[1668316.997397] TCP: out of memory -- consider tuning tcp_mem
[admin@xx~]$ cat /proc/sys/net/ipv4/tcp_mem
4096    4096    4096
[admin@xx~]$  cat /proc/net/sockstat
sockets: used 582
TCP: inuse 259 orphan 0 tw 18 alloc 289 mem 36
UDP: inuse 4 mem 0
UDPLITE: inuse 0
RAW: inuse 1
FRAG: inuse 0 memory 0
[admin@xx~]$

asc-adean · 2019-05-21T18:48:15Z

This caught me today, had several socket hang up in running applications resulting in HTTP timeouts. Went and did a drain/rolling restart of all nodes to get us back to a happy place.

Azure AKS, Kubernetes v1.13.5

tanya-borisova · 2019-08-29T13:25:51Z

We have run into this issue as well, we had a was leaking open connections which lead to a whole node being unusable and introduced a noisy neighbor problem.

suryababy · 2019-09-03T11:00:40Z

Recently we experienced an interesting production problem. This application was running on multiple AWS EC2 instances behind Elastic Load Balancer. The application was running on GNU/Linux OS, Java 8, Tomcat 8 application server. All of sudden one of the application instances became unresponsive. All other application instances were handling the traffic properly. Whenever the HTTP request was sent to this application instance from the browser, we were getting following response to be printed on the browser.

Proxy Error

The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /.

Reason: Error reading from remote server

Let us see how we resolved this issue by assigning values for these properties in the server:

net.core.netdev_max_backlog=30000 net.core.rmem_max=134217728 net.core.wmem_max=134217728 net.ipv4.tcp_max_syn_backlog=8192 net.ipv4.tcp_rmem=4096 87380 67108864 net.ipv4.tcp_wmem=4096 87380 67108864

TCP: out of memory — consider tuning tcp_mem

linjmeyer · 2020-10-08T13:56:20Z

We've had this problem as well on GKE nodes (Container-Optimized OS). It would be great to see Kubernetes handle this as it can effectively break the network stack of an entire node.

Slightly off topic, does anyone have any tips for determining which container/process is leaking the TCP memory? As a quick workaround we have increased the TCP memory but that can't work forever.

swatisehgal · 2021-06-25T11:06:15Z

/triage accepted
/priority backlog

thockin · 2024-03-14T16:22:53Z

Workarounds are good, but it's not clear to me if we should be doing more here - anyone who has direct context?

utix · 2024-03-14T16:44:18Z

Workarounds are good, but it's not clear to me if we should be doing more here - anyone who has direct context?

If a pod is leaking connections the pod will kill the node, without any alert or monitoring.
A pod should not be able to kill the node, or at least we need to monitor it.

Happy to give more context if you need it.

thockin · 2024-03-14T19:47:59Z

This is an old issue, which I won't have time to tackle in the near future - any context you can add here, to make it more approachable by some volunteer (could be you!) would help.

shaneutt · 2024-03-22T09:03:59Z

/remove-lifecycle frozen

kong62 · 2024-05-23T01:52:44Z

[root@pub-k8stx-mgt-prd-004037-cvm ~]# dmesg -T
[Wed May 22 19:08:35 2024] TCP: out of memory -- consider tuning tcp_mem
[Wed May 22 19:08:47 2024] TCP: out of memory -- consider tuning tcp_mem

[root@pub-k8stx-mgt-prd-004037-cvm ~]# sysctl -a 2>&1 | grep tcp_mem
net.ipv4.tcp_mem = 1501206      2001609 3002412

[root@pub-k8stx-mgt-prd-004037-cvm ~]# cat /proc/net/sockstat
sockets: used 8165
TCP: inuse 64 orphan 0 tw 1157 alloc 6881 mem 3003353
UDP: inuse 6 mem 2
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

adrianmoisey · 2024-05-23T16:22:36Z

/assign

ssichynskyi · 2024-06-03T09:37:11Z

Still experience a periodic GKE nodes outage because of tcp oom. Pods that are hosted on a deceased node becomes inoperable.

adrianmoisey · 2024-06-03T09:42:36Z

👍 Thanks for the info. I'm busy working on this. I hope to have a PR created soon.

adrianmoisey · 2024-06-20T16:23:51Z

/accept

adrianmoisey · 2024-06-20T16:24:06Z

/triage accepted

k8s-ci-robot · 2024-06-20T16:24:14Z

@adrianmoisey: The label triage/accepted cannot be applied. Only GitHub organization members can add the label.

In response to this:

/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

shaneutt · 2024-06-20T19:23:31Z

/triage accepted

adrianmoisey · 2024-08-06T16:55:05Z

I've been working on this. It's taking me some time, sorry about that.

I've gotten my code to mostly work, but I need to spend time finishing up the specifics.
I just want to get an idea of exactly what this change should be doing.

When the socket buffer is full, which of these should happen:

Node becomes unready (specifically, the "Ready" condition becomes False)
kubelet evicts Pods
Both 1 and 2 ?

Based on this conversation, I assume only bullet point 1 should happen (Node becomes unready).

Additionally, does this feature need to be feature gated?

(cc @thockin @aojea @shaneutt)

MartinEmrich · 2024-08-07T06:05:57Z

@adrianmoisey I would prefer "2", Kubelet should evict the pod causing the memory usage, just as it would evict pods exceeding their memory or ephemeralStorage allowance.
If the pod in question is evicted (thus its processes ended, the sockets claiming the tcp_mem closed), the tcp_mem usage goes down and the node itself stays operational.

aojea · 2024-08-15T16:37:53Z

/cc @aojea
From sig network discussion, some follow. up questions

is tcp_mem accounted as part of the global memory ? is namespaced? what are the behaviours we want to implement, TCP is bursty by nature, what happens if there are peaks of congestion?

bowei · 2024-08-15T17:41:40Z

I see a reference here in the Linux kernel documentation:

Search for section 2.7.1: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

2.7.1 Current Kernel Memory resources accounted

stack pages: every process consumes some stack pages. By accounting into
kernel memory, we prevent new processes from being created when the kernel
memory usage is too high.

slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy
of each kmem_cache is created every time the cache is touched by the first time
from inside the memcg. The creation is done lazily, so some objects can still be
skipped while the cache is being created. All objects in a slab page should
belong to the same memcg. This only fails to hold when a task is migrated to a
different memcg during the page allocation by the cache.

sockets memory pressure: some sockets protocols have memory pressure
thresholds. The Memory Controller allows them to be controlled individually
per cgroup, instead of globally.

tcp memory pressure: sockets memory pressure for the tcp protocol.

aojea · 2024-08-17T15:31:23Z

This is more of a feature request for kubelet rather than a bug. TCP mem can get filled if the node is running a lot of TCP heavy workloads, need not necessarily be a leak. Since kubelet is ultimately responsible for reporting node's health, network should also be a parameter.

Independently of everything, we depend of the mechanisms exposed by the kernel, based on https://lpc.events/event/16/contributions/1212/attachments/1079/2052/LPC%202022%20-%20TCP%20memory%20isolation.pdf this is still WIP, there are also some interesting lessons learned

● For multi-tenant servers, static tcp_mem is harmful.

adrianmoisey · 2024-08-20T17:58:30Z

Interesting share, thanks @aojea
I did some digging and I found that the memory.stat file in a cgroup has a sock counter, here's an example:

cat memory.stat
anon 0
file 126976
kernel 24576
kernel_stack 0
pagetables 0
sec_pagetables 0
percpu 0
sock 0                  <-------------
vmalloc 0
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 0
active_anon 0
inactive_file 126976
active_file 0
unevictable 0
slab_reclaimable 21008
slab_unreclaimable 1064
slab 22072
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 19
pgsteal 19
pgscan_kswapd 19
pgscan_direct 0
pgscan_khugepaged 0
pgsteal_kswapd 19
pgsteal_direct 0
pgsteal_khugepaged 0
pgfault 235
pgmajfault 8
pgrefill 32
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
zswpwb 0
thp_fault_alloc 0
thp_collapse_alloc 0
thp_swpout 0
thp_swpout_fallback 0

From https://docs.kernel.org/admin-guide/cgroup-v2.html:

sock (npn)
Amount of memory used in network transmission buffers

This makes me think that it may be possible to evict pods that are using up too much TCP transmission buffers.

I'm not sure if it's what we want to do though.

From an end user perspective, if kubelet is going to be evicting pods based on some behaviour, I'd like the ability to determine the bounds of what is good and what is bad. (ie: memory limits as an example).

Which makes me think that this would work better as a Pod resource, much like memory and CPU.

thockin · 2024-09-12T17:28:36Z

Is this accounted per-cgroup or just at the root? It looks like cgroup v2 is per-cgroup. Does that get accumulated into the total memory usage of the container?

Obviously, the ideal would be to kill the process/cgroup which is abusive. But it's not easy to know who that is unless it is accounted properly.

Anything with a global (machine-wide) limit which is shared by cgroups is likely to be an isolation problem.

uablrek · 2024-09-13T04:54:40Z

This may be related #116895.

It's in the same area ("invisible" pod memory causing oom), but seem to need privileged:true.

MartinEmrich · 2024-09-16T07:26:31Z

@thockin just a thought: If it is accounted per cgroup, it could be used to evict the culprit pod. But even if not, it could be handled on node level (like DiskPressure, PIDPressure, ..) which could lead to the node being marked as NotReady, or even be drained and removed completely. Any well-designed application could then fail over to other pods.

thockin · 2024-09-16T17:39:41Z

If a regular-privilege pod can cause a machine to go NotReady, that's a DoS vector. Now, I know that pods with memory limit > request fall into this category, but that is something an admin can prevent by policy. I am far enough away from kubelet's resource management code now that I am hand-waving. The pattern we want, I think is:

If we can account it to the cgroup, do that and set sane limits
If we can't manage it by cgroup but can otherwise tell who is using too much, do that and do something when they use too much
If we can't tell who is using too much, but we can limit the usage, do that in hopes of protecting the system "most of the time"
If we can't limit individual usage, but we can measure the total, report that
Otherwise, cry

adrianmoisey · 2024-09-17T09:17:52Z

I'm going to unassign myself from this issue for now. I've got other tasks I'm working on at the moment, and this one seems to be a little complicated for me right now. I'll happily pick it up in the future, if nobody else has done it.

/unassign

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 10, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 5, 2018

lzang mentioned this issue Oct 25, 2018

Add network metrics collection GoogleCloudPlatform/netd#50

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2018

thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019

freehan removed the triage/unresolved Indicates an issue that can not or will not be resolved. label May 16, 2019

k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 20, 2024

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 22, 2024

k8s-ci-robot assigned adrianmoisey May 23, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2024

adrianmoisey mentioned this issue Sep 6, 2024

REQUEST: New membership for adrianmoisey kubernetes/org#5133

Closed

11 tasks

k8s-ci-robot unassigned adrianmoisey Sep 17, 2024

kubelet should track tcp_mem stats also along with cpu/ram/disk #62334

kubelet should track tcp_mem stats also along with cpu/ram/disk #62334

Comments

shahidhk commented Apr 10, 2018 • edited Loading

shahidhk commented Apr 10, 2018 • edited Loading

cizixs commented Apr 11, 2018

shahidhk commented Apr 16, 2018

fejta-bot commented Jul 15, 2018

fejta-bot commented Aug 14, 2018

jaredallard commented Sep 5, 2018

fejta-bot commented Dec 4, 2018

utkarshmani1997 commented Dec 13, 2018

AWKIF commented Jan 29, 2019

somashekhar commented Mar 1, 2019

cyrus-mc commented Mar 13, 2019

anjuls commented Mar 29, 2019

asc-adean commented May 21, 2019 • edited Loading

tanya-borisova commented Aug 29, 2019

suryababy commented Sep 3, 2019 • edited Loading

linjmeyer commented Oct 8, 2020

swatisehgal commented Jun 25, 2021

thockin commented Mar 14, 2024

utix commented Mar 14, 2024

thockin commented Mar 14, 2024

shaneutt commented Mar 22, 2024

kong62 commented May 23, 2024

adrianmoisey commented May 23, 2024

ssichynskyi commented Jun 3, 2024

adrianmoisey commented Jun 3, 2024

adrianmoisey commented Jun 20, 2024

adrianmoisey commented Jun 20, 2024

k8s-ci-robot commented Jun 20, 2024

shaneutt commented Jun 20, 2024

adrianmoisey commented Aug 6, 2024

MartinEmrich commented Aug 7, 2024

aojea commented Aug 15, 2024

bowei commented Aug 15, 2024

aojea commented Aug 17, 2024

adrianmoisey commented Aug 20, 2024

thockin commented Sep 12, 2024

uablrek commented Sep 13, 2024

MartinEmrich commented Sep 16, 2024

thockin commented Sep 16, 2024

adrianmoisey commented Sep 17, 2024

shahidhk commented Apr 10, 2018 •

edited

Loading

shahidhk commented Apr 10, 2018 •

edited

Loading

asc-adean commented May 21, 2019 •

edited

Loading

suryababy commented Sep 3, 2019 •

edited

Loading