Skip to content

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

Open
@prattmic

Description

@prattmic

Split from #31908 (comment) and full write-up at https://jazco.dev/2024/01/10/golang-and-epoll/.

tl;dr is that a program on a 192 core machine with >2500 sockets and with >1k becoming ready at once results in huge costs in netpoll -> epoll_wait (~65% of total CPU).

Most interesting is that sharding these connections across 8 processes seems to solve the problem, implying some kind of super-linear scaling.

That the profile shows the time spent in epoll_wait itself, this may be a scalability problem in the kernel itself, but we may still be able to mitigate.

@ericvolp12, some questions if you don't mind answering:

  • Which version of Go are you using? And which kernel version?
  • Do you happen to have a reproducer for this problem that you could share? (Sounds like no?)
  • On a similar note, do you have a perf profile of this problem that shows where the time in the kernel is spent?
  • The 128 event buffer size is mentioned several times, but it is not obvious to me that increasing this size would actually solve the problem. Did you try increasing the size and see improved results?

cc @golang/runtime

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.OS-LinuxPerformanceScalabilityIssues related to runtime/application scalabilitycompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions