Skip to content

Swift Actor/Tasks concurrency on Linux - Lock contention in rand() #760

Closed
@freef4ll

Description

@freef4ll

We have a workload pipeline which is chaining several thousand Actors to each other via AsyncStream processing pipeline.
There is a multiplication affect that a single event at the start of the processing pipeline will be amplified as the event will be delivered to several Tasks processing the events concurrently. The processing time of each wakeup is currently quite small and on several microseconds range currently.

Under Linux, what was observed when stressing this processing pipeline is that ~45% of the stacks show __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__(), which is leading to lock contention in glibc rand() - as there are ~60 threads which are created and they all contend here:

            7f193794a2db futex_wait+0x2b (inlined)
            7f193794a2db __GI___lll_lock_wait_private+0x2b (inlined)
            7f19378ff29b __random+0x6b (/usr/lib/x86_64-linux-gnu/libc.so.6)
            7f19378ff76c rand+0xc (/usr/lib/x86_64-linux-gnu/libc.so.6)
            7f1937bac612 __DISPATCH_ROOT_QUEUE_CONTENDED_WAIT__+0x12 (/usr/lib/swift/linux/libdispatch.so)

This is occurring in every entrance of DISPATCH_ROOT_QUEUE_CONTENDED_WAIT(), while using macro _dispatch_contention_wait_until() which in turn uses _dispatch_contention_spins(), in here the rand() call comes in and the macro produces just these 4 values: 31, 63, 95 and 127 for how many pause/yield instructions to execute.

The following example can reproduce the issue where ~28% of the time when sampling is spent in the code path mentioned.
The example creates 5000 tasks which work between 1μs and 3μs and then sleep for random 6-10 milliseconds. The point of the test is to create the contention and to illustrate the issue with rand():

// $ swift package init --type executable --name RandomTasks
// $ cat  Sources/RandomTasks/main.swift &&  swift run -c release

import Foundation

let numberOfTasks = 5000
let randomSleepRangeMs: ClosedRange<UInt64> = 6 ... 10

// correlates closely to processing amount in micros
let randomWorkRange: ClosedRange<UInt32> = 1 ... 3

@available(macOS 10.15, *)
func smallInfinitiveTask() async {
    let randomWork = UInt32.random(in: randomWorkRange)
    let randomSleepNs = UInt64.random(in: randomSleepRangeMs) * 1_000_000
    print("Task start; sleep: \(randomSleepNs) ns, randomWork: \(randomWork) ")

    while true {
        do {
            var x2: String = ""
            x2.reserveCapacity(2000)
            for _ in 1 ... 50 * randomWork {
                x2 += "hi"
            }
            // Thread.sleep(forTimeInterval: 0.001) // 1ms
            try await Task.sleep(nanoseconds: randomSleepNs)
        } catch {}
    }
}

@available(macOS 10.15, *)
func startLotsOfTasks(_ tasks: Int) {
    for _ in 1 ... tasks {
        Task {
            await smallInfinitiveTask()
        }
    }
}

if #available(macOS 10.15, *) {
    startLotsOfTasks(numberOfTasks)
} else {
    // Fallback on earlier versions
    print("Unsupported")
}

sleep(600)

When run on Ryzen 5950X system, 18-19 HT cores are spent processing the workload. While on M1 Pro just ~4.

rand-contention

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions