Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image/jpeg: Decode is slow #24499

Open
deepch opened this issue Mar 23, 2018 · 13 comments
Open

image/jpeg: Decode is slow #24499

deepch opened this issue Mar 23, 2018 · 13 comments
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@deepch
Copy link

deepch commented Mar 23, 2018

Mac OS Sierra
go version go1.10 darwin/amd64
CPU 3,5 GHz Intel Core i7

I noticed that the use of decode jpeg is very slow.

decode image jpeg 1920x1080

I test github.com/pixiv/go-libjpeg/jpeg and native jpeg

go 1.10 jpeg.decode ≈ 30 ms cpu ≈ 15 %
libjpeg jpeg.decode ≈ 7 ms cpu ≈ 4 %

will it ever go as fast as other libraries?
is it possible that in the next versions the native implementation will become faster?

@dsnet dsnet changed the title jpeg decode slow more 50% other lib image/jpeg: Decode is slow Mar 23, 2018
@dsnet dsnet added Performance help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 23, 2018
@dsnet dsnet added this to the Unplanned milestone Mar 23, 2018
@odeke-em
Copy link
Member

/cc @horgh @ericlagergren

@v47
Copy link

v47 commented Apr 9, 2018

BenchmarkJPEG-8    1        56942870038 ns/op       12443537192 B/op            1558 allocs/op
BenchmarkPixiv-8   1        19192858155 ns/op        4058312304 B/op          141851 allocs/op

But allocation difference is crazy. I am not sure what's worse.

@ysmolski
Copy link
Member

@v47 can you share the code for your benchmarks?

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/125138 mentions this issue: image/jpeg: decomposes scan loops and pre-computes values

@rburchell
Copy link

rburchell commented Nov 8, 2018

Performance does seem quite bad. Using stb_image (via cgo) seems to be much faster in a simple test I've done:

2018/11/08 14:58:33 Loaded ../resources/Free_Spring_Blossoms_on_Blue_Creative_Commons_(3457362713).jpg via image: 234.715104ms
2018/11/08 14:58:34 Loaded ../resources/Free_Unedited_Happy_Little_Yellow_Stars_in_Pink_Flower_Creative_Commons_(2898759838).jpg via image: 307.379436ms

vs:

2018/11/08 15:00:07 Loaded ../resources/Free_Spring_Blossoms_on_Blue_Creative_Commons_(3457362713).jpg via stb_image: 93.289255ms
2018/11/08 15:00:07 Loaded ../resources/Free_Unedited_Happy_Little_Yellow_Stars_in_Pink_Flower_Creative_Commons_(2898759838).jpg via stb_image: 167.929126ms

This wasn't a scientific test, but the results were markedly different between the two approaches, and consistently so (+/- 10ms here or there with a few repeats).

@rburchell
Copy link

I cut what I have into a standalone benchmark, and pushed it up at https://github.com/rburchell/imgtestcase. In my case, the format conversion is costly, but loading the image itself is also far from fast.

A CPU profile of repeatedly loading images gives me this:

      flat  flat%   sum%        cum   cum%
    5340ms 19.51% 19.51%     5340ms 19.51%  image/internal/imageutil.DrawYCbCr
    5260ms 19.22% 38.73%     9290ms 33.94%  image/jpeg.(*decoder).reconstructBlock
    4030ms 14.72% 53.45%     4030ms 14.72%  image/jpeg.idct
    3310ms 12.09% 65.55%    20310ms 74.21%  image/jpeg.(*decoder).processSOS
    3180ms 11.62% 77.16%     4790ms 17.50%  image/jpeg.(*decoder).decodeHuffman
    2510ms  9.17% 86.34%     2640ms  9.65%  image/jpeg.(*decoder).receiveExtend
     920ms  3.36% 89.70%     1740ms  6.36%  image/jpeg.(*decoder).ensureNBits

@agnivade
Copy link
Contributor

agnivade commented Nov 8, 2018

Thanks for creating the benchmarks !

This wasn't a scientific test,

We use benchstat for that. Could you please run each benchmark in a quiet machine with browsers shutdown with -count=10, and then compare them with benchstat and post the results ?

@rburchell
Copy link

Sorry, no - I don't have any more time to spend on this. I posted the results on the README, but they are quite easy to run yourself. The benchmarks show the same trend as my original (application) numbers - loading using stb_image via cgo takes around half the time.

@agnivade
Copy link
Contributor

I took a look into it. The majority of the improvements in the C version come from SIMD(SSE2) support. If I disable SIMD and then run the benchmarks, the C version runs much slower. Although the Go version is still slow compared to it.

SIMD --
14:52:02-agniva-~/play/go/src/github.com/rburchell/imgtestcase$go test -run=xxx -bench=.
goos: linux
goarch: amd64
pkg: github.com/rburchell/imgtestcase
BenchmarkLoadImageOneSTB-4   	      20	  82806154 ns/op
BenchmarkLoadImageOneGo-4    	      10	 195602342 ns/op	 235.85 MB/s
BenchmarkLoadImageTwoSTB-4   	      10	 131362504 ns/op
BenchmarkLoadImageTwoGo-4    	       3	 342820337 ns/op	 117.59 MB/s
PASS
ok  	github.com/rburchell/imgtestcase	8.409s

NO-SIMD--
14:52:31-agniva-~/play/go/src/github.com/rburchell/imgtestcase$go test -run=xxx -bench=.
goos: linux
goarch: amd64
pkg: github.com/rburchell/imgtestcase
BenchmarkLoadImageOneSTB-4   	      10	 162488211 ns/op
BenchmarkLoadImageOneGo-4    	      10	 195514068 ns/op	 235.96 MB/s
BenchmarkLoadImageTwoSTB-4   	       5	 204837289 ns/op
BenchmarkLoadImageTwoGo-4    	       3	 343425094 ns/op	 117.38 MB/s
PASS
ok  	github.com/rburchell/imgtestcase	7.246s

Other than that, there are some BCE improvements that can be done in idct and fdct. But that barely gives improvement of 40-50ns.

I don't see any major algorithm improvements that can be applied. All math operations like huffman decoding, idct etc. have their fast paths, which is the same as the C version.

There may be some logic tweaks that can be done in the top time-taking functions like

    1070ms 13.39% 44.56%     7000ms 87.61%  image/jpeg.(*decoder).processSOS
    |------>1100ms 13.77% 31.16%     1360ms 17.02%  image/jpeg.(*decoder).refineNonZeroes
    |------>870ms 10.89% 55.44%     1720ms 21.53%  image/jpeg.(*decoder).reconstructBlock
    |------>860ms 10.76% 66.21%      900ms 11.26%  image/jpeg.(*decoder).receiveExtend
    |------>			 26.66%  image/jpeg.(*decoder).decodeHuffman

Anybody is welcome to investigate and send CLs.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/167417 mentions this issue: image/jpeg: reduce bound checks from idct and fdct

gopherbot pushed a commit that referenced this issue Apr 2, 2019
Before -
$gotip build -gcflags="-d=ssa/check_bce/debug=1" fdct.go idct.go
./fdct.go:89:10: Found IsInBounds
./fdct.go:90:10: Found IsInBounds
./fdct.go:91:10: Found IsInBounds
./fdct.go:92:10: Found IsInBounds
./fdct.go:93:10: Found IsInBounds
./fdct.go:94:10: Found IsInBounds
./fdct.go:95:10: Found IsInBounds
./fdct.go:96:10: Found IsInBounds
./idct.go:77:9: Found IsInBounds
./idct.go:77:27: Found IsInBounds
./idct.go:77:45: Found IsInBounds
./idct.go:78:7: Found IsInBounds
./idct.go:78:25: Found IsInBounds
./idct.go:78:43: Found IsInBounds
./idct.go:78:61: Found IsInBounds
./idct.go:79:13: Found IsInBounds
./idct.go:92:13: Found IsInBounds
./idct.go:93:12: Found IsInBounds
./idct.go:94:12: Found IsInBounds
./idct.go:95:12: Found IsInBounds
./idct.go:97:12: Found IsInBounds
./idct.go:98:12: Found IsInBounds
./idct.go:99:12: Found IsInBounds

After -
$gotip build -gcflags="-d=ssa/check_bce/debug=1" fdct.go idct.go
./fdct.go:90:9: Found IsSliceInBounds
./idct.go:76:11: Found IsSliceInBounds
./idct.go:145:11: Found IsSliceInBounds

name                 old time/op    new time/op    delta
FDCT-4                 1.85µs ± 2%    1.74µs ± 1%  -5.95%  (p=0.000 n=10+10)
IDCT-4                 1.94µs ± 2%    1.89µs ± 1%  -2.67%  (p=0.000 n=10+9)
DecodeBaseline-4       1.45ms ± 2%    1.46ms ± 1%    ~     (p=0.156 n=9+10)
DecodeProgressive-4    2.21ms ± 1%    2.21ms ± 1%    ~     (p=0.796 n=10+10)
EncodeRGBA-4           24.9ms ± 1%    25.0ms ± 1%    ~     (p=0.075 n=10+10)
EncodeYCbCr-4          26.1ms ± 1%    26.2ms ± 1%    ~     (p=0.573 n=8+10)

name                 old speed      new speed      delta
DecodeBaseline-4     42.5MB/s ± 2%  42.4MB/s ± 1%    ~     (p=0.162 n=9+10)
DecodeProgressive-4  27.9MB/s ± 1%  27.9MB/s ± 1%    ~     (p=0.796 n=10+10)
EncodeRGBA-4         49.4MB/s ± 1%  49.1MB/s ± 1%    ~     (p=0.066 n=10+10)
EncodeYCbCr-4        35.3MB/s ± 1%  35.2MB/s ± 1%    ~     (p=0.586 n=8+10)

name                 old alloc/op   new alloc/op   delta
DecodeBaseline-4       63.0kB ± 0%    63.0kB ± 0%    ~     (all equal)
DecodeProgressive-4     260kB ± 0%     260kB ± 0%    ~     (all equal)
EncodeRGBA-4           4.40kB ± 0%    4.40kB ± 0%    ~     (all equal)
EncodeYCbCr-4          4.40kB ± 0%    4.40kB ± 0%    ~     (all equal)

name                 old allocs/op  new allocs/op  delta
DecodeBaseline-4         5.00 ± 0%      5.00 ± 0%    ~     (all equal)
DecodeProgressive-4      13.0 ± 0%      13.0 ± 0%    ~     (all equal)
EncodeRGBA-4             4.00 ± 0%      4.00 ± 0%    ~     (all equal)
EncodeYCbCr-4            4.00 ± 0%      4.00 ± 0%    ~     (all equal)

Updates #24499

Change-Id: I6828d077b851817503a7c1a08235763f81bdadf9
Reviewed-on: https://go-review.googlesource.com/c/go/+/167417
Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
@smasher164
Copy link
Member

What is the assembly policy for the image/** packages? I'm working on an SSE2 implementation of the IDCT, but I'm wondering if there's a reason there are no .s files in image/**.

@agnivade
Copy link
Contributor

We usually stick to pure Go in high level packages like image/. In general, assembly is limited to crypto, math and highly specialized bytes and strings functions. Also see https://github.com/golang/go/wiki/AssemblyPolicy.

@nigeltao can say whether adding SIMD assembly is apt or not in image/.

@nigeltao
Copy link
Contributor

First, the AssemblyPolicy (for Go packages overall) applies equally well to image/**.

Amongst other things, standard library code is often read by people learning Go, so for that code, we favor simplicity and readability over raw performance more than what other Go packages choose. Neither position is wrong, just a different trade-off.

If adding a small amount of SIMD assembly to the standard library gets you a 1.5x benchmark improvement, then I'd probably take it. If adding a large amount of SIMD assembly gets you a 1.05x benchmark improvement, then I'd probably reject it.

What "small" and "large" means is subjective. It's hard to say without a specific SIMD code listing.

Note also that, when I say benchmarks, I'm primarily concerned about overall decode / encode benchmarks, not just FDCT / IDCT benchmarks. Users want to decode a JPEG image, they don't want to run IDCTs directly.

For example, https://go-review.googlesource.com/c/go/+/167417/2//COMMIT_MSG shows a 1.03x benchmark improvement to IDCTs, but no significant change to the overall decode benchmarks. If that change was a SIMD assembly change, I'd reject it as being of too small a benefit, compared to the costs of supporting assembly.

gqgs added a commit to gqgs/miru that referenced this issue Apr 26, 2020
benchmark            old ns/op     new ns/op     delta
Benchmark_Load-8     49097440      10258184      -79.11%

benchmark            old allocs     new allocs     delta
Benchmark_Load-8     616333         55             -99.99%

benchmark            old bytes     new bytes     delta
Benchmark_Load-8     2937995       1013548       -65.50%

Ref:
golang/go#24499
golang/go#15759
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

10 participants