-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
The https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl package contains benchmarking experiments we've written to measure performance scaling for the new experimental cooperative multithreading introduced in Julia v1.3.
I'm opening this issue to discuss surprising results from our "all-threads allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#3
In that benchmark, we @spawn 1000 "queries", each of which perform work(...), which performs 1e6 multiplications in a type-unstable way, causing O(1e6) allocations.
The experiment then runs this benchmark with increasing values for JULIA_NUM_THREADS= to measure the performance scaling on the benchmark as julia has more threads. The experiment was run on a 48-core (96-vCPU) machine, though the results are similar on my local 6-core laptop.
The experiment shows that, surprisingly, total time increases as the number of threads increases, despite the total work remaining constant (number of queries handled, number of allocations, and total memory allocated):

This seems to be explained by two factors, as outlined in the linked results:
Notice that there seem to be two causes of the increase in time for each addition of threads:
- The
gc_timeis increasing in each run, until we hit nthreads > number of physical cores. After that, thegc_timedrops to 0, presumably because the GC never gets run at all. This is one surprise: we would expect the GC time to remain constant, since the number of allocations and size of allocated memory remain constant.- The latency for every individual query is increasing in each run -- even though the
gc_timedrops to 0 between 48 threads and 72 threads. So this increase is (at least in part) independent of the increase ingc_time.
So the GC time is actually taking longer as you add threads, even though there's the same amount of work to do, and also something else is also taking longer as you add threads (maybe the allocation), even though there's the same amount of work to do.
@staticfloat and I have noticed via profiling (detailed in: RelationalAI-oss/MultithreadingBenchmarks.jl#3 (comment)) that the profiles spend increasing amounts of time in jl_mutex_wait as the number of threads increase.
It's not clear to me whether the profiles reflect GC time or not.
To summarize:
This benchmark is getting slower when adding threads, which is surprising. Even if garbage collection / allocation acquired a single global lock, forcing everything to run serially, I would still expect near constant time as you add threads. Instead, the time increases linearly, with a slope greater than 1. So this seems like maybe a bug somewhere?