Description
We have a workload pipeline which is chaining several thousand Actors to each other via AsyncStream processing pipeline.
There is a multiplication affect that a single event at the start of the processing pipeline will be amplified as the event will be delivered to several Tasks processing the events concurrently. The processing time of each wakeup is currently quite small and on several microseconds range currently.
Under Linux, what was observed when stressing this processing pipeline is that ~45% of the stacks show __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__()
, which is leading to lock contention in glibc rand() - as there are ~60 threads which are created and they all contend here:
7f193794a2db futex_wait+0x2b (inlined)
7f193794a2db __GI___lll_lock_wait_private+0x2b (inlined)
7f19378ff29b __random+0x6b (/usr/lib/x86_64-linux-gnu/libc.so.6)
7f19378ff76c rand+0xc (/usr/lib/x86_64-linux-gnu/libc.so.6)
7f1937bac612 __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__+0x12 (/usr/lib/swift/linux/libdispatch.so)
This is occurring in every entrance of DISPATCH_ROOT_QUEUE_CONTENDED_WAIT(), while using macro _dispatch_contention_wait_until() which in turn uses _dispatch_contention_spins(), in here the rand() call comes in and the macro produces just these 4 values: 31, 63, 95 and 127 for how many pause/yield instructions to execute.
The following example can reproduce the issue where ~28% of the time when sampling is spent in the code path mentioned.
The example creates 5000 tasks which work between 1μs and 3μs and then sleep for random 6-10 milliseconds. The point of the test is to create the contention and to illustrate the issue with rand():
// $ swift package init --type executable --name RandomTasks
// $ cat Sources/RandomTasks/main.swift && swift run -c release
import Foundation
let numberOfTasks = 5000
let randomSleepRangeMs: ClosedRange<UInt64> = 6 ... 10
// correlates closely to processing amount in micros
let randomWorkRange: ClosedRange<UInt32> = 1 ... 3
@available(macOS 10.15, *)
func smallInfinitiveTask() async {
let randomWork = UInt32.random(in: randomWorkRange)
let randomSleepNs = UInt64.random(in: randomSleepRangeMs) * 1_000_000
print("Task start; sleep: \(randomSleepNs) ns, randomWork: \(randomWork) ")
while true {
do {
var x2: String = ""
x2.reserveCapacity(2000)
for _ in 1 ... 50 * randomWork {
x2 += "hi"
}
// Thread.sleep(forTimeInterval: 0.001) // 1ms
try await Task.sleep(nanoseconds: randomSleepNs)
} catch {}
}
}
@available(macOS 10.15, *)
func startLotsOfTasks(_ tasks: Int) {
for _ in 1 ... tasks {
Task {
await smallInfinitiveTask()
}
}
}
if #available(macOS 10.15, *) {
startLotsOfTasks(numberOfTasks)
} else {
// Fallback on earlier versions
print("Unsupported")
}
sleep(600)
When run on Ryzen 5950X system, 18-19 HT cores are spent processing the workload. While on M1 Pro just ~4.