runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets

Split from https://github.com/golang/go/issues/31908#issuecomment-1887667141 and full write-up at https://jazco.dev/2024/01/10/golang-and-epoll/.

tl;dr is that a program on a 192 core machine with >2500 sockets and with >1k becoming ready at once results in huge costs in `netpoll -> epoll_wait` (~65% of total CPU).

Most interesting is that sharding these connections across 8 processes seems to solve the problem, implying some kind of super-linear scaling.

That the profile shows the time spent in `epoll_wait` itself, this may be a scalability problem in the kernel itself, but we may still be able to mitigate.

@ericvolp12, some questions if you don't mind answering:

* Which version of Go are you using? And which kernel version?
* Do you happen to have a reproducer for this problem that you could share? (Sounds like no?)
* On a similar note, do you have a `perf` profile of this problem that shows where the time in the kernel is spent?
* The 128 event buffer size is mentioned several times, but it is not obvious to me that increasing this size would actually solve the problem. Did you try increasing the size and see improved results?

cc @golang/runtime 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions