-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental: OpenMP #1013
base: main
Are you sure you want to change the base?
Experimental: OpenMP #1013
Conversation
Changes Unknown when pulling 5f1455f on wiredfool:openmp into * on python-pillow:master*. |
I very impressed how easy to use OMP. We should understand what this is not true performance win. This is just one way to use parallelism. For example my application is web server which resizes images on the fly. It already works in parallel if two requests come at the same time. And in my case it is more important to ensure what small and moderate images are processed in consistent time, rather than try to speedup large images resizing using all available resource. But without a doubt, OMP can be useful for wide range of tasks.
You can easily test any number of threads with
I have a i5 with two cores and hyper-threading. I have speedup in 1.8—1.95 times with two threads and no additionally speedup for three or four threads. Same for SSE and scalar versions. So SSE instructions on different cores don't interfere with each other.
I don't think so. SSE version with OMP works almost two times faster, so version without SSE can't be constrained by memory bandwidth. Maximum throughput for Bilinear resize on my system is 820Mpx/s or 3.2GB/s, what is far from maximum 25GB/s memory bandwidth. |
I think that this is a win, for a couple of reasons. We're basically able to speed up pillow in 3 ways: better algorithms (e.g. iterative approximations in box blur)), better implementations of those algorithms (cache awareness, SSE), and parallelization. Parallelization is orthogonal to the other two in most cases, and that's the real win here. This is a minimally invasive fine grained parallelization technique. It is easily enabled and disabled, should you have coarse grained methods already running. It doesn't appear to be dangerous -- it could introduce bugs, but it's likely to introduce far fewer than if we were adding threads manually. It is additive over the other methods, as there are SSE units available on each core (though it may impact cache awareness, but that's a tuning thing). I remember a guest lecture back in school -- the speaker was talking about the advances in large scale finite element analysis solving. Over a decade or more, there was a 6 order of magnitude increase in the raw capacity of the machines, and over the same time there was just of much of an advance in solving the linear equations. (iirc, it was Invert -> LU -> several different iterative methods, the later ones converging very quickly). Of course, what that really meant was that the problems got bigger and the solution time never really changed. The other option we have for parallelization is to look at GPU based kernels, likely with opencl. This is a far more developer intensive method, and much less useful for most server based operations. On the other hand, the speedups are likely to be better than we'd see than just throughput*ncores.
I may be jumping to conclusions, but that would seem to imply that RealCores(tm) are reasonably well saturated by this workload, and that there aren't a whole lot of stalls or other inefficiencies that are easily exploitable by additional hardware threads. Could probably test this with by cpu pinning my virtual machine to specific cores. Or, there are stalls, but the other thread isn't able to take advantage of them because it's stalled as well. I know I've seen some details on how to dig in and instrument that, but I'd have to find them again. |
Running tonight I'm noticing that the benchmarks are jumping around a lot -- I'm seeing +- 50% on some of them on sequential runs with the same code. |
Changes Unknown when pulling 522db0f on wiredfool:openmp into * on python-pillow:master*. |
libImaging/Antialias.c
Outdated
case IMAGING_TYPE_FLOAT32: | ||
break; | ||
default: | ||
ImagingSectionLeave(&cookie); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Al least there is an error in this line.
In general, we can move this code before any allocations and don't free them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I see that.
Indeed. Result across 20 runs on each size:
For two threads maximum is always almost as large as minimum on one thread, and average for two threads is only 1.5—1.6 times faster. |
I've added standard deviations to mine, and I've noticed that the more iterations, generally the higher the standard deviation. The non-opemnp versions are somewhat more consistent, but not significantly. All of these are with n=40, but I've tried up to 200. The results are representative, not necessarily the most consistent or the worst of the runs. Runs are much more consistent when the VM cores are pinned to specific processors, with <= 4 cores they go to 100% each, at 8 cores (4+hyperthreading) they were all running at 85% or so. Deviation % is the standard deviation/mean No openmp
OpenMP 4 Seperate cores:
two cores + two hyperthreads
8 cores
2 cores
Another 2 core run
Current test script:
|
I've just remerged this to master. |
So, I've had a bit of a go with this on a monster machine. This is on a 2 ghz 96 core ARM machine with 128G memory. Individual cores are about half as fast as my (old) laptop cores on the test suite, but I can't say that we're particularly well optimized here. There's a sub-linear speedup on the cores, I'm seeing about 30-60% usage and speeds in the 20-30x range of a single core. without openmp:
With openmp, resize
|
While looking through the SSE version of @homm's benchmarking of the new stretch implementation, I ran across an Intel paper on speeding up imaging operations. In addition to vectorization, they used openmp to parallelize the loops with very little developer effort. GCC 4.4+ ships with OpenMP 3.0, which is good enough for what we would need to do here. I've put the pragma on the horizontal stretch inner loop and put together the necessary bits to get Pillow built using OpenMP.
python setup.py build_ext --enable-openmp install
or simplymake install-openmp
Using the same benchmark as #977,
Current master:
With OpenMP on a 4 cores of an i7 from 2008 in a Ubuntu 12.04 vm, running kvm. (So, not cutting edge, but not wimpy)
Note that the speeds of the OpenMP version seem to be roughly constrained by memory bandwidth rather than in processor ops. I think this is a win, but needs further investigation.
There are at least a few places where I think work may be required:
Links: