Skip to content

"All-threads allocating garbage" Multithreading Benchmark shows significant slowdown as nthreads increases #33033

@NHDaly

Description

@NHDaly

The https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl package contains benchmarking experiments we've written to measure performance scaling for the new experimental cooperative multithreading introduced in Julia v1.3.

I'm opening this issue to discuss surprising results from our "all-threads allocating garbage" experiment, detailed here:
RelationalAI-oss/MultithreadingBenchmarks.jl#3

In that benchmark, we @spawn 1000 "queries", each of which perform work(...), which performs 1e6 multiplications in a type-unstable way, causing O(1e6) allocations.

The experiment then runs this benchmark with increasing values for JULIA_NUM_THREADS= to measure the performance scaling on the benchmark as julia has more threads. The experiment was run on a 48-core (96-vCPU) machine, though the results are similar on my local 6-core laptop.

The experiment shows that, surprisingly, total time increases as the number of threads increases, despite the total work remaining constant (number of queries handled, number of allocations, and total memory allocated):
all_tasks_allocating-abs_time_plot-96vCPUs

This seems to be explained by two factors, as outlined in the linked results:

Notice that there seem to be two causes of the increase in time for each addition of threads:

  1. The gc_time is increasing in each run, until we hit nthreads > number of physical cores. After that, the gc_time drops to 0, presumably because the GC never gets run at all. This is one surprise: we would expect the GC time to remain constant, since the number of allocations and size of allocated memory remain constant.
  2. The latency for every individual query is increasing in each run -- even though the gc_time drops to 0 between 48 threads and 72 threads. So this increase is (at least in part) independent of the increase in gc_time.

So the GC time is actually taking longer as you add threads, even though there's the same amount of work to do, and also something else is also taking longer as you add threads (maybe the allocation), even though there's the same amount of work to do.

@staticfloat and I have noticed via profiling (detailed in: RelationalAI-oss/MultithreadingBenchmarks.jl#3 (comment)) that the profiles spend increasing amounts of time in jl_mutex_wait as the number of threads increase.
It's not clear to me whether the profiles reflect GC time or not.


To summarize:

This benchmark is getting slower when adding threads, which is surprising. Even if garbage collection / allocation acquired a single global lock, forcing everything to run serially, I would still expect near constant time as you add threads. Instead, the time increases linearly, with a slope greater than 1. So this seems like maybe a bug somewhere?

Metadata

Metadata

Assignees

No one assigned

    Labels

    multithreadingBase.Threads and related functionalityperformanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions