-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image/jpeg: Decode is slow #24499
Comments
/cc @horgh @ericlagergren |
But allocation difference is crazy. I am not sure what's worse. |
@v47 can you share the code for your benchmarks? |
Change https://golang.org/cl/125138 mentions this issue: |
Performance does seem quite bad. Using stb_image (via cgo) seems to be much faster in a simple test I've done:
vs:
This wasn't a scientific test, but the results were markedly different between the two approaches, and consistently so (+/- 10ms here or there with a few repeats). |
I cut what I have into a standalone benchmark, and pushed it up at https://github.com/rburchell/imgtestcase. In my case, the format conversion is costly, but loading the image itself is also far from fast. A CPU profile of repeatedly loading images gives me this:
|
Thanks for creating the benchmarks !
We use benchstat for that. Could you please run each benchmark in a quiet machine with browsers shutdown with |
Sorry, no - I don't have any more time to spend on this. I posted the results on the README, but they are quite easy to run yourself. The benchmarks show the same trend as my original (application) numbers - loading using stb_image via cgo takes around half the time. |
I took a look into it. The majority of the improvements in the C version come from SIMD(SSE2) support. If I disable SIMD and then run the benchmarks, the C version runs much slower. Although the Go version is still slow compared to it.
Other than that, there are some BCE improvements that can be done in idct and fdct. But that barely gives improvement of 40-50ns. I don't see any major algorithm improvements that can be applied. All math operations like huffman decoding, idct etc. have their fast paths, which is the same as the C version. There may be some logic tweaks that can be done in the top time-taking functions like
Anybody is welcome to investigate and send CLs. |
Change https://golang.org/cl/167417 mentions this issue: |
Before - $gotip build -gcflags="-d=ssa/check_bce/debug=1" fdct.go idct.go ./fdct.go:89:10: Found IsInBounds ./fdct.go:90:10: Found IsInBounds ./fdct.go:91:10: Found IsInBounds ./fdct.go:92:10: Found IsInBounds ./fdct.go:93:10: Found IsInBounds ./fdct.go:94:10: Found IsInBounds ./fdct.go:95:10: Found IsInBounds ./fdct.go:96:10: Found IsInBounds ./idct.go:77:9: Found IsInBounds ./idct.go:77:27: Found IsInBounds ./idct.go:77:45: Found IsInBounds ./idct.go:78:7: Found IsInBounds ./idct.go:78:25: Found IsInBounds ./idct.go:78:43: Found IsInBounds ./idct.go:78:61: Found IsInBounds ./idct.go:79:13: Found IsInBounds ./idct.go:92:13: Found IsInBounds ./idct.go:93:12: Found IsInBounds ./idct.go:94:12: Found IsInBounds ./idct.go:95:12: Found IsInBounds ./idct.go:97:12: Found IsInBounds ./idct.go:98:12: Found IsInBounds ./idct.go:99:12: Found IsInBounds After - $gotip build -gcflags="-d=ssa/check_bce/debug=1" fdct.go idct.go ./fdct.go:90:9: Found IsSliceInBounds ./idct.go:76:11: Found IsSliceInBounds ./idct.go:145:11: Found IsSliceInBounds name old time/op new time/op delta FDCT-4 1.85µs ± 2% 1.74µs ± 1% -5.95% (p=0.000 n=10+10) IDCT-4 1.94µs ± 2% 1.89µs ± 1% -2.67% (p=0.000 n=10+9) DecodeBaseline-4 1.45ms ± 2% 1.46ms ± 1% ~ (p=0.156 n=9+10) DecodeProgressive-4 2.21ms ± 1% 2.21ms ± 1% ~ (p=0.796 n=10+10) EncodeRGBA-4 24.9ms ± 1% 25.0ms ± 1% ~ (p=0.075 n=10+10) EncodeYCbCr-4 26.1ms ± 1% 26.2ms ± 1% ~ (p=0.573 n=8+10) name old speed new speed delta DecodeBaseline-4 42.5MB/s ± 2% 42.4MB/s ± 1% ~ (p=0.162 n=9+10) DecodeProgressive-4 27.9MB/s ± 1% 27.9MB/s ± 1% ~ (p=0.796 n=10+10) EncodeRGBA-4 49.4MB/s ± 1% 49.1MB/s ± 1% ~ (p=0.066 n=10+10) EncodeYCbCr-4 35.3MB/s ± 1% 35.2MB/s ± 1% ~ (p=0.586 n=8+10) name old alloc/op new alloc/op delta DecodeBaseline-4 63.0kB ± 0% 63.0kB ± 0% ~ (all equal) DecodeProgressive-4 260kB ± 0% 260kB ± 0% ~ (all equal) EncodeRGBA-4 4.40kB ± 0% 4.40kB ± 0% ~ (all equal) EncodeYCbCr-4 4.40kB ± 0% 4.40kB ± 0% ~ (all equal) name old allocs/op new allocs/op delta DecodeBaseline-4 5.00 ± 0% 5.00 ± 0% ~ (all equal) DecodeProgressive-4 13.0 ± 0% 13.0 ± 0% ~ (all equal) EncodeRGBA-4 4.00 ± 0% 4.00 ± 0% ~ (all equal) EncodeYCbCr-4 4.00 ± 0% 4.00 ± 0% ~ (all equal) Updates #24499 Change-Id: I6828d077b851817503a7c1a08235763f81bdadf9 Reviewed-on: https://go-review.googlesource.com/c/go/+/167417 Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Nigel Tao <nigeltao@golang.org>
What is the assembly policy for the image/** packages? I'm working on an SSE2 implementation of the IDCT, but I'm wondering if there's a reason there are no .s files in image/**. |
We usually stick to pure Go in high level packages like image/. In general, assembly is limited to @nigeltao can say whether adding SIMD assembly is apt or not in image/. |
First, the AssemblyPolicy (for Go packages overall) applies equally well to Amongst other things, standard library code is often read by people learning Go, so for that code, we favor simplicity and readability over raw performance more than what other Go packages choose. Neither position is wrong, just a different trade-off. If adding a small amount of SIMD assembly to the standard library gets you a 1.5x benchmark improvement, then I'd probably take it. If adding a large amount of SIMD assembly gets you a 1.05x benchmark improvement, then I'd probably reject it. What "small" and "large" means is subjective. It's hard to say without a specific SIMD code listing. Note also that, when I say benchmarks, I'm primarily concerned about overall decode / encode benchmarks, not just FDCT / IDCT benchmarks. Users want to decode a JPEG image, they don't want to run IDCTs directly. For example, https://go-review.googlesource.com/c/go/+/167417/2//COMMIT_MSG shows a 1.03x benchmark improvement to IDCTs, but no significant change to the overall decode benchmarks. If that change was a SIMD assembly change, I'd reject it as being of too small a benefit, compared to the costs of supporting assembly. |
benchmark old ns/op new ns/op delta Benchmark_Load-8 49097440 10258184 -79.11% benchmark old allocs new allocs delta Benchmark_Load-8 616333 55 -99.99% benchmark old bytes new bytes delta Benchmark_Load-8 2937995 1013548 -65.50% Ref: golang/go#24499 golang/go#15759
Mac OS Sierra
go version go1.10 darwin/amd64
CPU 3,5 GHz Intel Core i7
I noticed that the use of decode jpeg is very slow.
decode image jpeg 1920x1080
I test github.com/pixiv/go-libjpeg/jpeg and native jpeg
go 1.10 jpeg.decode ≈ 30 ms cpu ≈ 15 %
libjpeg jpeg.decode ≈ 7 ms cpu ≈ 4 %
will it ever go as fast as other libraries?
is it possible that in the next versions the native implementation will become faster?
The text was updated successfully, but these errors were encountered: