Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental: OpenMP #1013

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Experimental: OpenMP #1013

wants to merge 4 commits into from

Conversation

wiredfool
Copy link
Member

While looking through the SSE version of @homm's benchmarking of the new stretch implementation, I ran across an Intel paper on speeding up imaging operations. In addition to vectorization, they used openmp to parallelize the loops with very little developer effort. GCC 4.4+ ships with OpenMP 3.0, which is good enough for what we would need to do here. I've put the pragma on the horizontal stretch inner loop and put together the necessary bits to get Pillow built using OpenMP.

python setup.py build_ext --enable-openmp install or simply make install-openmp

Using the same benchmark as #977,

Current master:

Interpolation Size Time
Antialias 2048x1152 0.4238
Antialias 320x240 0.2331
Bicubic 2048x1152 0.3306
Bicubic 320x240 0.1509
Bilinear 2048x1152 0.2369
Bilinear 320x240 0.08423

With OpenMP on a 4 cores of an i7 from 2008 in a Ubuntu 12.04 vm, running kvm. (So, not cutting edge, but not wimpy)

Interpolation Size Time
Antialias 2048x1152 0.1845
Antialias 320x240 0.08882
Bicubic 2048x1152 0.212
Bicubic 320x240 0.1025
Bilinear 2048x1152 0.1741
Bilinear 320x240 0.06423

Note that the speeds of the OpenMP version seem to be roughly constrained by memory bandwidth rather than in processor ops. I think this is a win, but needs further investigation.

There are at least a few places where I think work may be required:

  • Currently, gcc only. I think windows could be supported easily, there's currently no support for openmp in clang/xcode. There is openmp for clang, but it's not in mainline yet.
  • Unsure of the packaging issues for binaries.
  • Not sure what the performance will be like on systems that advertise 32 threads but only provide 1.5 worth of cores. (e.g. travis).
  • Unsure of interaction with SSE. Would be awesome if it boosted that as well.
  • Testing and benching is going to be important.
  • Can't return from the loop, need to trap errors elsewhere. Didn't fix that here, but would need to be done prior to actual usage.

Links:

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5f1455f on wiredfool:openmp into * on python-pillow:master*.

@homm
Copy link
Member

homm commented Nov 16, 2014

I very impressed how easy to use OMP. We should understand what this is not true performance win. This is just one way to use parallelism. For example my application is web server which resizes images on the fly. It already works in parallel if two requests come at the same time. And in my case it is more important to ensure what small and moderate images are processed in consistent time, rather than try to speedup large images resizing using all available resource. But without a doubt, OMP can be useful for wide range of tasks.

Not sure what the performance will be like on systems that advertise 32 threads but only provide 1.5 worth of cores. (e.g. travis).

You can easily test any number of threads with OMP_NUM_THREADS env var. I haven't noticed any slowdown on VM with one core and OMP_NUM_THREADS=32. Also we can set OMP_NUM_THREADS to 4 for example specifically for Travis.

Unsure of interaction with SSE. Would be awesome if it boosted that as well.

I have a i5 with two cores and hyper-threading. I have speedup in 1.8—1.95 times with two threads and no additionally speedup for three or four threads. Same for SSE and scalar versions. So SSE instructions on different cores don't interfere with each other.

speeds of the OpenMP version seem to be roughly constrained by memory bandwidth

I don't think so. SSE version with OMP works almost two times faster, so version without SSE can't be constrained by memory bandwidth. Maximum throughput for Bilinear resize on my system is 820Mpx/s or 3.2GB/s, what is far from maximum 25GB/s memory bandwidth.

@wiredfool
Copy link
Member Author

I think that this is a win, for a couple of reasons. We're basically able to speed up pillow in 3 ways: better algorithms (e.g. iterative approximations in box blur)), better implementations of those algorithms (cache awareness, SSE), and parallelization. Parallelization is orthogonal to the other two in most cases, and that's the real win here. This is a minimally invasive fine grained parallelization technique. It is easily enabled and disabled, should you have coarse grained methods already running. It doesn't appear to be dangerous -- it could introduce bugs, but it's likely to introduce far fewer than if we were adding threads manually. It is additive over the other methods, as there are SSE units available on each core (though it may impact cache awareness, but that's a tuning thing).

I remember a guest lecture back in school -- the speaker was talking about the advances in large scale finite element analysis solving. Over a decade or more, there was a 6 order of magnitude increase in the raw capacity of the machines, and over the same time there was just of much of an advance in solving the linear equations. (iirc, it was Invert -> LU -> several different iterative methods, the later ones converging very quickly). Of course, what that really meant was that the problems got bigger and the solution time never really changed.

The other option we have for parallelization is to look at GPU based kernels, likely with opencl. This is a far more developer intensive method, and much less useful for most server based operations. On the other hand, the speedups are likely to be better than we'd see than just throughput*ncores.

I have speedup in 1.8—1.95 times with two threads and no additionally speedup for three or four threads

I may be jumping to conclusions, but that would seem to imply that RealCores(tm) are reasonably well saturated by this workload, and that there aren't a whole lot of stalls or other inefficiencies that are easily exploitable by additional hardware threads. Could probably test this with by cpu pinning my virtual machine to specific cores. Or, there are stalls, but the other thread isn't able to take advantage of them because it's stalled as well. I know I've seen some details on how to dig in and instrument that, but I'd have to find them again.

@wiredfool
Copy link
Member Author

Running tonight I'm noticing that the benchmarks are jumping around a lot -- I'm seeing +- 50% on some of them on sequential runs with the same code.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 522db0f on wiredfool:openmp into * on python-pillow:master*.

case IMAGING_TYPE_FLOAT32:
break;
default:
ImagingSectionLeave(&cookie);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Al least there is an error in this line.

In general, we can move this code before any allocations and don't free them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I see that.

@homm
Copy link
Member

homm commented Nov 19, 2014

Running tonight I'm noticing that the benchmarks are jumping around a lot

Indeed. Result across 20 runs on each size:

Without OpenMP
Antialias | 2048x1152 | min 0.4323 max 0.4961 average 0.4561
Antialias | 320x240   | min 0.2299 max 0.2884 average 0.2408
Bicubic   | 2048x1152 | min 0.3391 max 0.3988 average 0.3549
Bicubic   | 320x240   | min 0.1581 max 0.1913 average 0.1656
Bilinear  | 2048x1152 | min 0.2419 max 0.2977 average 0.2586
Bilinear  | 320x240   | min 0.0845 max 0.1098 average 0.0887

OpenMP, One thread
Antialias | 2048x1152 | min 0.4404 max 0.5129 average 0.4621
Antialias | 320x240   | min 0.2337 max 0.2973 average 0.2477
Bicubic   | 2048x1152 | min 0.3417 max 0.4104 average 0.3591
Bicubic   | 320x240   | min 0.1607 max 0.1918 average 0.1688
Bilinear  | 2048x1152 | min 0.2492 max 0.3304 average 0.2624
Bilinear  | 320x240   | min 0.0856 max 0.1060 average 0.0886

OpenMP, Two threads
Antialias | 2048x1152 | min 0.2344 max 0.3851 average 0.2925
Antialias | 320x240   | min 0.1198 max 0.2172 average 0.1585
Bicubic   | 2048x1152 | min 0.1780 max 0.3180 average 0.2313
Bicubic   | 320x240   | min 0.0816 max 0.1428 average 0.0936
Bilinear  | 2048x1152 | min 0.1312 max 0.2244 average 0.1741
Bilinear  | 320x240   | min 0.0442 max 0.0843 average 0.0548

For two threads maximum is always almost as large as minimum on one thread, and average for two threads is only 1.5—1.6 times faster.

@wiredfool
Copy link
Member Author

I've added standard deviations to mine, and I've noticed that the more iterations, generally the higher the standard deviation. The non-opemnp versions are somewhat more consistent, but not significantly. All of these are with n=40, but I've tried up to 200. The results are representative, not necessarily the most consistent or the worst of the runs. Runs are much more consistent when the VM cores are pinned to specific processors, with <= 4 cores they go to 100% each, at 8 cores (4+hyperthreading) they were all running at 85% or so.

Deviation % is the standard deviation/mean

No openmp

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.517 0.663 0.589 0.599 0.0459 7.8%
Antialias 320x240 0.310 0.376 0.330 0.329 0.0178 5.4%
Bicubic 2048x1152 0.439 0.543 0.472 0.463 0.0339 7.2%
Bicubic 320x240 0.205 0.229 0.208 0.206 0.0071 3.4%
Bilinear 2048x1152 0.318 0.357 0.338 0.336 0.0111 3.3%
Bilinear 320x240 0.114 0.124 0.115 0.114 0.0023 2.0%

OpenMP

4 Seperate cores:

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.248 0.313 0.274 0.276 0.0222 8.1%
Antialias 320x240 0.172 0.198 0.179 0.174 0.0101 5.6%
Bicubic 2048x1152 0.265 0.311 0.289 0.286 0.0146 5.0%
Bicubic 320x240 0.136 0.162 0.149 0.159 0.0114 7.7%
Bilinear 2048x1152 0.201 0.244 0.214 0.216 0.0105 4.9%
Bilinear 320x240 0.075 0.077 0.075 0.075 0.0005 0.7%

two cores + two hyperthreads

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.250 0.331 0.284 0.287 0.0249 8.8%
Antialias 320x240 0.183 0.215 0.193 0.185 0.0121 6.3%
Bicubic 2048x1152 0.261 0.307 0.271 0.265 0.0129 4.8%
Bicubic 320x240 0.146 0.179 0.162 0.173 0.0135 8.3%
Bilinear 2048x1152 0.200 0.253 0.226 0.228 0.0127 5.6%
Bilinear 320x240 0.074 0.079 0.076 0.076 0.0014 1.9%

8 cores

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.149 0.234 0.181 0.183 0.0199 11.0%
Antialias 320x240 0.102 0.197 0.128 0.119 0.0216 16.9%
Bicubic 2048x1152 0.190 0.275 0.234 0.234 0.0245 10.4%
Bicubic 320x240 0.105 0.178 0.133 0.130 0.0176 13.2%
Bilinear 2048x1152 0.142 0.276 0.191 0.188 0.0279 14.6%
Bilinear 320x240 0.050 0.140 0.087 0.086 0.0206 23.8%

2 cores

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.293 0.408 0.329 0.324 0.0310 9.4%
Antialias 320x240 0.191 0.234 0.210 0.208 0.0102 4.9%
Bicubic 2048x1152 0.283 0.344 0.298 0.284 0.0180 6.0%
Bicubic 320x240 0.135 0.166 0.140 0.136 0.0071 5.1%
Bilinear 2048x1152 0.208 0.258 0.223 0.222 0.0119 5.3%
Bilinear 320x240 0.071 0.158 0.086 0.072 0.0273 31.7%

Another 2 core run

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.318 0.403 0.347 0.346 0.0200 5.8%
Antialias 320x240 0.186 0.301 0.220 0.215 0.0189 8.6%
Bicubic 2048x1152 0.286 0.397 0.318 0.311 0.0237 7.5%
Bicubic 320x240 0.137 0.168 0.151 0.150 0.0084 5.6%
Bilinear 2048x1152 0.230 0.304 0.251 0.253 0.0127 5.0%
Bilinear 320x240 0.076 0.087 0.082 0.083 0.0027 3.3%

Current test script:

from PIL import Image
import time
import math

def timeit(n, f, *args, **kwargs):
    def run():
        start = time.time()
        f(*args, **kwargs)
        return time.time() - start

    runs = [run() for _ in range(n)]
    mean = sum(runs)/float(n)
    stddev = math.sqrt(sum((r-mean)**2 for r in runs)/float(n))
    return {'mean':mean,
            'median': sorted(runs)[int(n/2)],
            'min': min(runs),
            'max': max(runs),
            'stddev':stddev,
            'dev_pct': stddev/mean*100.0
            }

    #return min(run() for _ in range(n))

n = 40
image = Image.open('5k_image.png').copy()
print 'warmup {mean:.4}'.format(**timeit(n // 4, image.im.stretch, (2048, 1152), Image.ANTIALIAS))
print "%s runs"%n
print "Interpolation | Size  |  min  |  max  |  mean | median| stddev | Dev %"
print "--------- | --------- | ----- | ----- | ----- | ----- | -----  | ----"
print 'Antialias | 2048x1152 | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (2048, 1152), Image.ANTIALIAS))
print 'Antialias | 320x240   | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (320, 240),   Image.ANTIALIAS))
print 'Bicubic   | 2048x1152 | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (2048, 1152), Image.BICUBIC))
print 'Bicubic   | 320x240   | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (320, 240),   Image.BICUBIC))
print 'Bilinear  | 2048x1152 | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (2048, 1152), Image.BILINEAR))
print 'Bilinear  | 320x240   | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (320, 240),   Image.BILINEAR))
"""

@homm homm mentioned this pull request Feb 11, 2015
@aclark4life aclark4life added this to the Future milestone Apr 1, 2015
@wiredfool wiredfool removed the No Auto label Jun 17, 2015
@wiredfool
Copy link
Member Author

I've just remerged this to master.

@wiredfool
Copy link
Member Author

wiredfool commented Nov 30, 2017

So, I've had a bit of a go with this on a monster machine. This is on a 2 ghz 96 core ARM machine with 128G memory. Individual cores are about half as fast as my (old) laptop cores on the test suite, but I can't say that we're particularly well optimized here. There's a sub-linear speedup on the cores, I'm seeing about 30-60% usage and speeds in the 20-30x range of a single core.

without openmp:

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.947 0.952 0.949 0.949 0.0011 0.1%
Antialias 320x240 0.565 0.571 0.566 0.565 0.0010 0.2%
Bicubic 2048x1152 0.637 0.649 0.638 0.637 0.0019 0.3%
Bicubic 320x240 0.391 0.391 0.391 0.391 0.0000 0.0%
Bilinear 2048x1152 0.416 0.416 0.416 0.416 0.0000 0.0%
Bilinear 320x240 0.227 0.227 0.227 0.227 0.0000 0.0%

With openmp, resize

Interpolation Size min max mean median stddev Dev %
Antialias 2048x1152 0.036 0.043 0.037 0.037 0.0010 2.7%
Antialias 320x240 0.027 0.030 0.028 0.028 0.0007 2.5%
Bicubic 2048x1152 0.019 0.023 0.020 0.020 0.0006 3.0%
Bicubic 320x240 0.008 0.016 0.012 0.013 0.0023 18.9%
Bilinear 2048x1152 0.008 0.009 0.008 0.008 0.0001 1.1%
Bilinear 320x240 0.005 0.005 0.005 0.005 0.0001 1.8%

big_arm_htop2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants