-
Notifications
You must be signed in to change notification settings - Fork 0
Perf profile on i9-9980XE #2
Comments
That's definitely not an expected output. At first I suspected that the compiler might have become more clever and optimized the microbenchmark out, but I just checked that as of latest stable (1.39) that's not true on my old i3-based machine:
Intuitively, the fact that all parallel versions take the same time on your machine make me suspect that there's an issue with parallelization. Can you cross-check in your system monitor that all CPU cores actually get used during your parallel runs? Regarding RNGs, I am not comfortable with precomputing all data in advance because it...
We could probably agree on some common choice of RNG + seed though. FYI, Rust's rand library uses a few rounds of the ChaCha block cipher as its default PRNG for security reasons. I didn't bother changing it at the time since it was fast enough not to hide the effect that I was interested in and speed of development was important. But since you're interested in a more easily reproducible configuration, I updated the latest master to use xoshiro128+ with the initial seed Note that this is not enough to make execution reproducible between e.g. sequential and N threads, but only across runs with the same number of threads. Which should be enough, with 300M iterations this benchmark has largely enough statistics for the exact choice of seed not to have a significant influence on the results... |
Also, bear in mind that this benchmark was built to study and compare the overhead of various concurrent histogramming strategies, and therefore puts a somewhat unfair amount of worst-case emphasis on said overhead as opposed to trying to be a fair representation of how people fill histograms in practice. |
What is strange is that I see multiple threads during the sequential execution but they seem idle during the parallel exec: I don't really mind the non-reproducibility as long as it gives me ballpark order of magnitudes of time to run and speedup of sequential vs parallel. In any case, the good news is that I ended up finding and porting another 2D histogram benchmark I have found to my runtime (and the bad news is that it's a pathological case for it). Anyway feel free to close the issue, unfortunately I don't think I'll have the time to relearn Rust and then understand why it behaves this way on my machine. |
Aha, I see what your problem is. You run You see, the thing is that Rust provides a nice built-in test runner, and I find it great to be able to just annotate a bunch of functions as So what I do in meantime is to abuse the test runner as a benchmark runner by...
This latter flag does not prevent an individual "test" from spawning multiple threads internally, and that's exactly what happens here when running with Anyhow, good luck with your project, reductions are always the most "interesting" part of parallel programming toolkits due to the amount of communication involved ;) I'm personally skeptical that using a pure message-passing approach for everything is ever going to approach the performance of a well-tuned deque-based runtime in communication-heavy scenarios, but I'd love to be surprised! |
I wanted to add some special support for them but I guess I'll just do for loop + critical sections/atomics ;)
You're in for a treat then, for both Task Parallelism or Data Parallelism, my library has actually significantly less overhead and similar or better load distribution than:
I didn't find any Task-Parallel Rust library to compare with unfortunately. For scheduler overhead focus benchmarks in my bench suite (https://github.com/mratsim/weave/tree/master/benchmarks) the difference can be very significant:
It can also deal with proper nested loop and not just OpenMP collapsed loops, which was the main reason as started this runtime, for the linear algebra/deep learning compiler I'm planning I want to have nested loop parallelism to symplify for example batched matrix multiplication implementations. Example on a 2D tiled matrix transposition algorithm (you can embed pure C in Nim (https://github.com/mratsim/weave/tree/333375df1e268f42207bc88a4881157de3af054e/benchmarks/matrix_transposition)
proc omp2DTiledCollapsedTranspose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
## Transpose with 2D tiling and collapsed
const blck = 64
{.emit: """
#define min(a,b) (((a)<(b))?(a):(b))
#pragma omp parallel for collapse(2)
for (int j = 0; j < `N`; j+=`blck`)
for (int i = 0; i < `M`; i+=`blck`)
for (int jj = j; jj<j+`blck` && jj<`N`; jj++)
for (int ii = i; ii<min(i+`blck`,`M`); ii++)
`bufOut`[ii+jj*`M`] = `bufIn`[jj+ii*`N`];
""".}
proc weave2DTiledNestedTranspose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
## Transpose with 2D tiling and nested
const blck = 64 # const do not need to be captured
parallelForStrided j in 0 ..< N, stride = blck:
captures: {M, N, bufIn, bufOut}
parallelForStrided i in 0 ..< M, stride = blck:
captures: {j, M, N, bufIn, bufOut}
for jj in j ..< min(j+blck, N):
for ii in i ..< min(i+blck, M):
bufOut[jj*M+ii] = bufIn[ii*N+jj] Perf (the bench is quite unstable due to the matrices being small, but it gives a +-20% ballpark approximation: |
Can you give me a sample build command for the purpose of experimenting with this further? I've never built a Nim program before, and did not find anything looking like a build system in the weave repo so I just downloaded the latest Nim nightly (as stable wouldn't build your benchmarks), looked at So I just wanted to check with you what's the recommended way to build a Nim program in an "as fast as possible" configuration. |
In any case, here are quick and dirty Rayon ports of your simplest benchmarks. On my machine, they exhibit performance that's roughly on par with the Nim version (compiled with either the bunch of flags above or the Which, honestly, is already a pretty impressive achievement from your library, given the design constraints that you're operating under. |
Nice that you figured this out, yes Note that I think in your fib benchmark you are spawning an extra thread compared to my bench if I read this correctly: https://github.com/HadrienG2/weave-parallel-benchmarks-rs/blob/004a559970bf74ebd22f69fd0abe101f85765c0c/src/bin/fib.rs#L5 Here are the build command line arguments for Weave:
|
So, actually that was a benchmark bug, in the sense that I was inadvertently cheating by using a Rayon construct that's idiomatic for users of that library, but less general-purpose than the Weave spawn construct and therefore more amenable to optimization by the Rayon implementors. The latest master provides a clearer choice between "use the idiomatic Rayon tool for the job" and "closely match what the Nim version of the benchmark does". It does the latter by default (let's play fair), and does the former when built with You will notice that the fib benchmark performs much worse in the non-idiomatic configuration, because Rayon's |
...and after further reflection, the non-idiomatic version of Note that the Rayon team has plans to integrate support for Rust futures in the longer term, which could be more directly comparable with Weave's future-based design (and hopefully will motivate the Rayon team to improve scope's performance ;)). |
Well my goal with those bench is in the end to produce a tool that helps me write fast code, as long as there is a way and you don't have to bend too much compared to something natural I think it's OK in the end. In those benches though the goal was to compare scheduler overhead so those tasks need to spawn a couple trillions of tasks to be meaningful ;). Now I think it's fine to have both implementations alongside. In my own library, unfortunately reductions are broken so I probably would fare badly with a map-reduce implementation. (But I have ways around that since OpenMP reductions are also a bit inflexible so I used plain parallel for when I was still hoping to salvage OpenMP - https://github.com/numforge/laser/blob/master/examples/ex05_tensor_parallel_reduction.nim) The pattern for both fib and dfs is pretty regular though, I think nqueens would might be a bit harder without futures. |
At the risk of sounding pedantic even though I understand what you mean (and largely agree!), the word "task" can have two meanings in modern parallel runtimes:
The former is an API-visible runtime user concern, the latter is a runtime implementor concern that should remain hidden from the API and only emerge when the runtime takes bad scheduling decisions that force the user to exert some manual control on it. A very fertile ground for parallel runtime ergonomics and optimization over the years has been to realize that these two kinds of work items do not need to have the same granularity:
Runtimes with dynamic work splitting make more complex dynamic scheduling decisions, and therefore are intrinsically slower (assuming an equal degree of optimization) in the ideal case of a homogeneous "parallel for" iteration workload whose execution schedule can be easily decided at compile time. But they make up for it by handling more complex scenarios like quicksort or heterogeneous hardware efficiently without requiring manual user intervention like specifications of work granularity. Therefore, although I agree with you that scheduling overhead can and should be benchmarked at the granularity of individual work item handling in scheduling queues, it's important to keep in mind that such a benchmark is not 100% fair to the cleverness and flexibility of modern dynamic runtimes (which cannot win against an equally well-optimized static runtime with dumber and faster scheduling decisions, only get close) and that we also need benchmarks which stress a runtime's ability to make good work splitting and merging decision. And this is why I thought it was interesting to have both versions of the benchmark. One which is a pure stress test for the runtime's scheduling queues (and therefore tests the runtime's low-level scheduling queue optimizations), and one which uses the native runtime API in a way which is both more convenient to the user and a bit less stressful for the runtime's queues (and therefore tests the runtime's high-level scheduling logic and "exposing the right API" optimizations). Anyway, enough philosophical babbling about APIs and dynamic scheduling, let's try to add an nqueens port to my repo ;) |
Agreed
Actually Weave doesn't merge back parallel loop, what it does is there: https://github.com/mratsim/weave/blob/333375df1e268f42207bc88a4881157de3af054e/weave/parallel_for.nim#L23-L47 template parallelForWrapper(
idx: untyped{ident},
prologue, loopBody, epilogue,
remoteAccum, resultTy,
returnStmt: untyped): untyped =
## To be called within a loop task
## Gets the loop bounds and iterate the over them
## Also poll steal requests in-between iterations
##
## Loop prologue, epilogue,
## remoteAccum, resultTy and returnStmt
## are unused
block:
let this = myTask()
ascertain: this.isLoop
ascertain: this.start == this.cur
var idx {.inject.} = this.start
this.cur += this.stride
while idx < this.stop:
loopBody
idx += this.stride
this.cur += this.stride
loadBalance(Weave) After each loop iteration, the runtime checks for incoming steal requests (unfortunate overhead of using message-passing, the workers that have actual work have to do also the load balancing while in classic Chase-Lev deque work-stealing it's the thief that have nothing to do that incur this overhead.) There are many benefits to this, it makes writing generic algorithm like
Tested operations:
For contiguous tensor operation:
For discontiguous tensor operation:
This lead to optimizing for the common denominator (say a 4~8 cores CPU) instead of properly using the hardware. My own low-level library that I'm building as a linear-algebra/HPC/deep-learning backend struggles with the same thing const OMP_MEMORY_BOUND_GRAIN_SIZE*{.intdefine.} = 1024
## This is the minimum amount of work per physical cores
## for memory-bound processing.
## - "copy" and "addition" are considered memory-bound
## - "float division" can be considered 2x~4x more complex
## and should be scaled down accordingly
## - "exp" and "sin" operations are compute-bound and
## there is a perf boost even when processing
## only 1000 items on 28 cores
##
## Launching 2 threads per core (HyperThreading) is probably desirable:
## - https://medium.com/data-design/destroying-the-myth-of-number-of-threads-number-of-physical-cores-762ad3919880
##
## Raising the following parameters can have the following impact:
## - number of sockets: higher, more over memory fetch
## - number of memory channel: lower, less overhead per memory fetch
## - RAM speed: lower, less overhead per memory fetch
## - Private L2 cache: higher, feed more data per CPU
## - Hyperthreading and cache associativity
## - Cores, shared L3 cache: Memory contention
##
## Note that setting num_threads manually might impact performance negatively:
## - http://studio.myrian.fr/openmp-et-num_threads/
## > 2x2ms overhead when changing num_threads from 16->6->16
const OMP_NON_CONTIGUOUS_SCALE_FACTOR*{.intdefine.} = 4
## Due to striding computation, we can use a lower grainsize
## for non-contiguous tensors Unfortunately I'm also pretty sure that Intel TBB also suffers from the curse of "how to choose a grain size" as the splitting is eager and there is no feedback loop that would inform the runtime "is it worth it to parallelize if I have 1000 exponentiation", "what if it's 1000 cos?", so it's up to the developer to figure out the grain size and you get issue that cannot be solved: pytorch/pytorch#3146 Now, maybe re-merging would be interesting but I think not splitting in the first place is best ;). Also lazy loop splitting composes much better with nested parallelism when parallelism is found at a higher level like another for loop (batch matrix multiplication) or graph-level parallelism for neural network with splitted code-paths, see pytorch/glow#1749, taskflow/taskflow#97. Dealing with nested parallelism is actually one of the main Julia focus with PARTR which uses something relatively unknown: a Parallel Depth-First Scheduler instead of a work-stealing scheduler, instead of spawning tasks from the root of the task tree (assuming spawn/sync semantics), it's down depth first to reuse the parallel hard work down by people behind BLAS or FFTW. In short, I think in today's era where:
a static scheduler cannot keep up, even for simple for-loop. If you like reading about runtimes, I have a treasure chest of papers there: https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md For dealing with overheads, I have laid out my techniques to deal with multithreaded memory management here:
|
Some good news on my side, I finally understood why I had very low performance on parallel reduction in mratsim/weave#83. It was because my "loadBalance" routine didn't split parallel loop tasks, meaning I had sequential speed + overhead of steal requests that couldn't be satisfied. Following the fix, Weave is now faster than OpenMP on my histogram used-to-be-pathological benchmark. The last frontier is implementing a matrix multiplication that is as fast as Intel MKL GEMM. On my machine, with a max throughput of 4TFlops, for a 1920x1920 by 1920x1920 matrix multiplication
|
I released the v0.3.0 of Weave with several fixes and improvement on nested data parallelism: On Matrix Multiplication on my 18-core machine with a single-threaded baseline of:
I get the following speedup with Weave (backoff means the worker goes to sleep on steal failure, not just
The main problem with a message-passing scheduler is that in coarse-grained tasks, the thief needs cooperation of the victims which might be in a heavy computation kernel. So despite a worst-case scenario with relatively long inner matmul kernel where the victim cannot reply to thieves and lots of synchronizations due to double or triple nested parallelism, message-passing based runtimes can scale. I tried looking into pure Rust matrix multiplication as well but:
|
Hello Hadrien,
I was looking for some nice map-reduce benchmarks to add to evaluate my multithreading runtime (Weave) and was thinking that histograms could be a nice example.
Your benchmark's performance profile is very strange on my machine:
This is the output of the default configuration:
Not too sure why all parallel versions are as slow as each other.
Side note on RNG
Also you might want to preallocate the array of values so that it's easier to compare across languages without the influences of RNG implementation.
AFAIK many languages (Nim, Lua, Julia) are using Xoroshiro instead of XorShift that Rust seems to be using.
Furthermore the Xoroshiro family all have a
jump
function that allow them to jump by 2^64 generated numbers (see http://prng.di.unimi.it/xoroshiro128plus.c). This would allow to have reproducible RNG sequences from 1 seed without conflict as long as each thread is started 2^64 RNG iterations apart.Reference: http://prng.di.unimi.it/
The text was updated successfully, but these errors were encountered: