"All-threads allocating garbage" Multithreading Benchmark shows significant slowdown as nthreads increases

The https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl package contains benchmarking experiments we've written to measure performance scaling for the new experimental cooperative multithreading introduced in Julia v1.3.

I'm opening this issue to discuss surprising results from our "all-threads allocating garbage" experiment, detailed here:
https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl/issues/3

In that benchmark, we `@spawn` 1000 "queries", each of which perform `work(...)`, which performs 1e6 multiplications in a type-unstable way, causing O(1e6) allocations.

The experiment then runs this benchmark with increasing values for `JULIA_NUM_THREADS=` to measure the performance scaling on the benchmark as julia has more threads. The experiment was run on a 48-core (96-vCPU) machine, though the results are similar on my local 6-core laptop.

The experiment shows that, surprisingly, total time _increases_ as the number of threads increases, despite the total work remaining constant (number of queries handled, number of allocations, and total memory allocated):
![all_tasks_allocating-abs_time_plot-96vCPUs](https://user-images.githubusercontent.com/1582097/63545984-a920aa80-c4f6-11e9-86f5-bc94085f2eab.png)

This seems to be explained by two factors, as outlined in the linked results:
> Notice that there seem to be two causes of the increase in time for each addition of threads:
> 1. The `gc_time` is increasing in each run, until we hit nthreads > number of physical cores. After that, the `gc_time` drops to 0, presumably because the GC never gets run at all. This is one surprise: we would expect the GC time to remain constant, since the number of allocations and size of allocated memory remain constant.
> 2. The latency for every individual query is increasing in each run -- even though the `gc_time` drops to 0 between 48 threads and 72 threads. So this increase is (at least in part) independent of the increase in `gc_time`.


So the GC time is actually taking _longer_ as you add threads, even though there's the same amount of work to do, and also something _else_ is also taking longer as you add threads (maybe the allocation), even though there's the same amount of work to do.

@staticfloat and I have noticed via profiling (detailed in: https://github.com/RelationalAI-oss/MultithreadingBenchmarks.jl/issues/3#issuecomment-524061449) that the profiles spend increasing amounts of time in `jl_mutex_wait` as the number of threads increase.
It's not clear to me whether the profiles reflect GC time or not.

--------------------------

## To summarize:
This benchmark is getting slower when adding threads, which is surprising. _Even if_ garbage collection / allocation acquired a single global lock, forcing everything to run serially, I would still expect near constant time as you add threads. Instead, the time increases linearly, with a slope greater than 1. So this seems like maybe a bug somewhere?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

"All-threads allocating garbage" Multithreading Benchmark shows significant slowdown as nthreads increases #33033

To summarize:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

"All-threads allocating garbage" Multithreading Benchmark shows significant slowdown as nthreads increases #33033

Description

To summarize:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions