Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: x86 assembler implementation of sequenceDecs.decode #528

Merged
merged 6 commits into from
Mar 17, 2022

Conversation

WojciechMula
Copy link
Contributor

@WojciechMula WojciechMula commented Mar 10, 2022

This is plain x86 and x86 with BMI2 implementation of sequenceDecs.decode. Part of #515.

Since the benchmarks use decodeSync I temporarily replaced its implementation with one using decode and execute, at cost of allocation of the seqVals array every time.

There are some IMHO nice improvements and small regressions in few cases. From my previous experience can tell that we'll get quite big speedup when rewrite execute. And of course we'll get the biggest speedup when fuse decode and execute into a single procedure.

Marking PR as a draft as just one test TestNewDecoderBad/Reader-4/6f88497edbc9059998f9e6d0ea0d0eed8d8af38d.zst fails. Have to investigate why. [fixed]

Below are benchmarks.

  • old.txt was produced by the command go generate && go test -tags noasm -run XYZ -bench BenchmarkDecoder.
  • new.txt was produced by the command go generate && go test -run XYZ -bench BenchmarkDecoder.
  • new-bmi2.txt was produced by the command go generate && GOAMD64=v3 go test -run XYZ -bench BenchmarkDecoder.

Comparison of old.txt with new.txt

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            237.60       238.72       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        764.22       868.49       1.14x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         192.39       197.70       1.03x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           226.37       233.74       1.03x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         211.46       216.24       1.02x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          189.03       190.01       1.01x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1743.29      1951.14      1.12x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3079.05      3309.04      1.07x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       6696.56      7926.76      1.18x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             337.14       365.98       1.09x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 613.59       687.53       1.12x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        345.78       374.54       1.08x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               244.27       241.55       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           785.09       912.82       1.16x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            200.18       203.32       1.02x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              237.47       239.27       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            211.75       214.06       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             190.52       188.88       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1397.55      1488.07      1.06x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          3428.09      3716.01      1.08x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          9548.90      10887.83     1.14x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                356.48       386.03       1.08x
BenchmarkDecoder_DecodeAll/html.zst-16                                    598.01       666.57       1.11x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           333.55       364.23       1.09x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      208.57       222.10       1.06x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      206.44       206.26       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       219.64       224.09       1.02x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         215.01       212.89       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9564.96      10855.81     1.13x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          230.88       242.77       1.05x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           300.04       357.40       1.19x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             481.01       629.48       1.31x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1153.28      1167.83      1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1178.62      1197.84      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1061.04      1085.91      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 431.41       434.83       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 260.26       288.78       1.11x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 178.69       182.34       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  178.37       183.01       1.03x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    163.05       166.78       1.02x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       379.84       413.38       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       364.05       398.18       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        396.48       440.21       1.11x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          346.17       375.39       1.08x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9561.29      10873.23     1.14x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         224.96       240.51       1.07x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          303.94       364.00       1.20x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            476.76       635.21       1.33x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1602.64      1693.12      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1470.20      1556.52      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1781.10      1894.80      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1542.00      1661.43      1.08x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9556.32      10879.80     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9571.19      10882.04     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9572.19      10887.11     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9566.54      10886.24     1.14x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1343.71      1607.61      1.20x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1311.12      1407.07      1.07x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1401.88      1587.61      1.13x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1402.53      1484.78      1.06x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         66679.48     94849.27     1.42x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         1306.06      1502.06      1.15x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          1927.42      2336.68      1.21x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            3472.36      4863.07      1.40x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6276.41      6383.51      1.02x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5490.46      5771.14      1.05x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              6008.30      6052.66      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3799.77      3895.76      1.03x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1412.04      1441.62      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                962.89       949.33       0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 984.52       969.09       0.98x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   801.71       795.11       0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      1738.74      1974.58      1.14x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1633.98      1841.22      1.13x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       1817.99      2012.59      1.11x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1717.94      1874.86      1.09x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        66526.60     96359.49     1.45x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        1268.55      1490.20      1.17x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         1947.34      2373.92      1.22x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           3458.55      4850.24      1.40x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   8243.76      8724.07      1.06x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   8197.25      8948.34      1.09x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    9020.42      9939.28      1.10x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      10641.45     11529.73     1.08x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    66560.21     95518.08     1.44x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    66587.20     94626.59     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     66651.43     94356.64     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       66512.25     95444.30     1.43x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1466.81      1604.80      1.09x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   4831.35      5497.25      1.14x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1215.11      1374.36      1.13x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1464.34      1623.73      1.11x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1240.22      1406.06      1.13x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1094.77      1206.81      1.10x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        10029.54     11377.73     1.13x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  19167.95     22324.31     1.16x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  66383.12     95910.86     1.44x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        2687.11      3090.49      1.15x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            3748.35      4307.61      1.15x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1959.71      2145.97      1.10x

Comparison of old.txt with new-bmi2.txt

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            237.60       238.72       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        764.22       868.49       1.14x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         192.39       197.70       1.03x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           226.37       233.74       1.03x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         211.46       216.24       1.02x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          189.03       190.01       1.01x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1743.29      1951.14      1.12x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3079.05      3309.04      1.07x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       6696.56      7926.76      1.18x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             337.14       365.98       1.09x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 613.59       687.53       1.12x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        345.78       374.54       1.08x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               244.27       241.55       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           785.09       912.82       1.16x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            200.18       203.32       1.02x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              237.47       239.27       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            211.75       214.06       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             190.52       188.88       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1397.55      1488.07      1.06x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          3428.09      3716.01      1.08x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          9548.90      10887.83     1.14x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                356.48       386.03       1.08x
BenchmarkDecoder_DecodeAll/html.zst-16                                    598.01       666.57       1.11x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           333.55       364.23       1.09x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      208.57       222.10       1.06x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      206.44       206.26       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       219.64       224.09       1.02x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         215.01       212.89       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9564.96      10855.81     1.13x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          230.88       242.77       1.05x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           300.04       357.40       1.19x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             481.01       629.48       1.31x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1153.28      1167.83      1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1178.62      1197.84      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1061.04      1085.91      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 431.41       434.83       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 260.26       288.78       1.11x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 178.69       182.34       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  178.37       183.01       1.03x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    163.05       166.78       1.02x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       379.84       413.38       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       364.05       398.18       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        396.48       440.21       1.11x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          346.17       375.39       1.08x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9561.29      10873.23     1.14x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         224.96       240.51       1.07x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          303.94       364.00       1.20x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            476.76       635.21       1.33x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1602.64      1693.12      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1470.20      1556.52      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1781.10      1894.80      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1542.00      1661.43      1.08x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9556.32      10879.80     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9571.19      10882.04     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9572.19      10887.11     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9566.54      10886.24     1.14x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1343.71      1607.61      1.20x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1311.12      1407.07      1.07x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1401.88      1587.61      1.13x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1402.53      1484.78      1.06x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         66679.48     94849.27     1.42x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         1306.06      1502.06      1.15x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          1927.42      2336.68      1.21x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            3472.36      4863.07      1.40x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6276.41      6383.51      1.02x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5490.46      5771.14      1.05x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              6008.30      6052.66      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3799.77      3895.76      1.03x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1412.04      1441.62      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                962.89       949.33       0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 984.52       969.09       0.98x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   801.71       795.11       0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      1738.74      1974.58      1.14x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1633.98      1841.22      1.13x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       1817.99      2012.59      1.11x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1717.94      1874.86      1.09x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        66526.60     96359.49     1.45x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        1268.55      1490.20      1.17x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         1947.34      2373.92      1.22x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           3458.55      4850.24      1.40x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   8243.76      8724.07      1.06x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   8197.25      8948.34      1.09x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    9020.42      9939.28      1.10x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      10641.45     11529.73     1.08x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    66560.21     95518.08     1.44x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    66587.20     94626.59     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     66651.43     94356.64     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       66512.25     95444.30     1.43x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1466.81      1604.80      1.09x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   4831.35      5497.25      1.14x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1215.11      1374.36      1.13x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1464.34      1623.73      1.11x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1240.22      1406.06      1.13x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1094.77      1206.81      1.10x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        10029.54     11377.73     1.13x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  19167.95     22324.31     1.16x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  66383.12     95910.86     1.44x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        2687.11      3090.49      1.15x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            3748.35      4307.61      1.15x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1959.71      2145.97      1.10x

@klauspost
Copy link
Owner

Looks great! Impressive numbers - is this comparing to a similar setup, ie not using "decodeSync"?

I wouldn't mind a version for vbmi and one without. I guess the difference is measurable.

Yes, I expect "execute" to be much better. Especially if we "overallocate" the destination by 7 bytes, so we can do non-overlapping copies in blocks of 8 bytes.

@WojciechMula
Copy link
Contributor Author

Looks great! Impressive numbers - is this comparing to a similar setup, ie not using "decodeSync"?

From what I can gather there are no benchmarks that use async API, thus I had to hack decodeSync. BTW I compared a few days ago the new master (after split) with the code I forked from and there is 5-8% regression on IceLake.

Yes, I expect "execute" to be much better. Especially if we "overallocate" the destination by 7 bytes, so we can do non-overlapping copies in blocks of 8 bytes.

My experiments showed that if we copy in 32-byte blocks (i.e. using AVX2 registers) we're getting the biggest speedup. However, I optimized just the most common path: i.e. copy from literals + copy from s.out, and bail out to the Go code to handle more complex cases.

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick questions.

zstd/seqdec_amd64.s Outdated Show resolved Hide resolved
zstd/seqdec_amd64.s Outdated Show resolved Hide resolved
zstd/seqdec_amd64.s Outdated Show resolved Hide resolved
zstd/seqdec_amd64.s Outdated Show resolved Hide resolved
zstd/seqdec_amd64.s Outdated Show resolved Hide resolved
@klauspost
Copy link
Owner

From what I can gather there are no benchmarks that use async API, thus I had to hack decodeSync.

Yes, it is only useful for streams, and in that case the execute stage is usually the slowest. When we don't split decode/execute it is fastest to not write the intermediate values. What I meant was if "before" is using the old decodeSync, or the modified one?

I was actually thinking of making a microbench for decode and execute separately, so they could we tested independently. I can look into that.

BTW I compared a few days ago the new master (after split) with the code I forked from and there is 5-8% regression on IceLake.

Benchmarks can be really hard to interpret. It could be that it doesn't like this change - but it could also just be something like jump alignment - it can easily vary by that much with seemingly unrelated changes.

@WojciechMula
Copy link
Contributor Author

WojciechMula commented Mar 10, 2022

What I meant was if "before" is using the old decodeSync, or the modified one?

The modified one.

I was actually thinking of making a microbench for decode and execute separately, so they could we tested independently. I can look into that.

It would be great!

Benchmarks can be really hard to interpret. It could be that it doesn't like this change - but it could also just be something like jump alignment - it can easily vary by that much with seemingly unrelated changes.

Yeah, I know. That Ice Lake machine I use is an AWS one, and sometimes exactly the same code run twice can be slower/faster for no clear reason. Thus I gave up any microoptimizations, as I couldn't tell if differences are due to my changes or the Moon's phase. BTW, speaking of loop alignment, I wanted to enforce it, but seems that is disabled right now for the x86 target: https://groups.google.com/g/golang-nuts/c/M86PTw1jl6w.

@WojciechMula WojciechMula marked this pull request as ready for review March 10, 2022 18:50
zstd/seqdec_amd64.s Outdated Show resolved Hide resolved
zstd/seqdec.go Outdated Show resolved Hide resolved
@WojciechMula
Copy link
Contributor Author

The failed CI was due to a race test of flate/deflate_test.go: https://github.com/klauspost/compress/runs/5508309654?check_suite_focus=true, unrelated to this PR.

@klauspost
Copy link
Owner

The failed CI was due to a race test of flate/deflate_test.go: https://github.com/klauspost/compress/runs/5508309654?check_suite_focus=true, unrelated to this PR.

Yeah, sometimes we get a very slow machine. This is just a timeout. Restarting the actions.

@klauspost
Copy link
Owner

klauspost commented Mar 11, 2022

@WojciechMula I have added benchmarks for decode, execute and decodeSync in #530

I hope it doesn't conflict too much. I will merge when tests pass.

@WojciechMula
Copy link
Contributor Author

@WojciechMula I have added benchmarks for decode, execute and decodeSync in #530

I hope it doesn't conflict too much. I will merge when tests pass.

@klauspost Great! I'll rebase to the master when you merge the benchmarks and then squash commits.

@klauspost
Copy link
Owner

Pure AMD64:

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32            141344        105927        -25.06%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32           146059        104225        -28.64%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                   132771        97027         -26.92%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32            16573         10860         -34.47%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32            40159         25812         -35.73%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32             98292         69494         -29.30%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                      9979          7078          -29.07%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32         184877        133433        -27.83%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                       58.7          138           +135.21%
Benchmark_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                        950           760           -19.91%
Benchmark_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                    9333          6521          -30.13%
Benchmark_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                       33926         21103         -37.80%
Benchmark_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32                  50962         31479         -38.23%
Benchmark_seqdec_decode/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                        99586         67130         -32.59%

With BMI:

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32            141344        92330         -34.68%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32           146059        95747         -34.45%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                   132771        85915         -35.29%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32            16573         9315          -43.79%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32            40159         22776         -43.29%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32             98292         62976         -35.93%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                      9979          5995          -39.92%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32         184877        117648        -36.36%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                       58.7          141           +139.82%
Benchmark_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                        950           678           -28.56%
Benchmark_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                    9333          5621          -39.77%
Benchmark_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                       33926         18070         -46.74%
Benchmark_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32                  50962         27972         -45.11%
Benchmark_seqdec_decode/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                        99586         60449         -39.30%

BMI speedup alone:

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32            105927        92330         -12.84%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32           104225        95747         -8.13%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                   97027         85915         -11.45%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32            10860         9315          -14.23%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32            25812         22776         -11.76%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32             69494         62976         -9.38%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                      7078          5995          -15.30%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32         133433        117648        -11.83%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                       138           141           +1.96%
Benchmark_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                        760           678           -10.80%
Benchmark_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                    6521          5621          -13.80%
Benchmark_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                       21103         18070         -14.37%
Benchmark_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32                  31479         27972         -11.14%

So it seems like it is worth it to have runtime BMI detection.

With only 2 sequences there is a small regressions. Good to know, but not worth special code IMO.

@klauspost
Copy link
Owner

@klauspost Great! I'll rebase to the master when you merge the benchmarks and then squash commits.

Merged. The conflict should be trivial. Please remove the decodeSync change for now. It looks good and we can move forward from here.

@WojciechMula
Copy link
Contributor Author

These are results from Ice Lake (noasm vs GOAMD64=v3)

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-16            163287        131585        -19.41%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-16           173628        134655        -22.45%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-16                   145739        131145        -10.01%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-16            17244         13604         -21.11%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-16            39179         31955         -18.44%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-16             113786        86306         -24.15%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-16                      10915         8957          -17.94%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-16         203105        154422        -23.97%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-16                       62.8          106           +69.24%

So it seems like it is worth it to have runtime BMI detection.

Do you think simple CPUID invocation just to detect BMI1/BMI2 would be sufficient? Or do you prefer to add to dependencies your library https://github.com/klauspost/cpuid?

zstd/seqdec_amd64.go Outdated Show resolved Hide resolved
@WojciechMula
Copy link
Contributor Author

@klauspost What do you plan to do with this PR? Merge or amend with your avo implementation?

@klauspost
Copy link
Owner

@WojciechMula Let's put in the avo version, add CPU tests and remove the decodeSync hack.

@WojciechMula
Copy link
Contributor Author

@WojciechMula Let's put in the avo version, add CPU tests and remove the decodeSync hack.

@klauspost OK, I'm doing this right now.

@klauspost
Copy link
Owner

The failed test can be ignored. I will add a check for it separately.

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. Do you want to add anything else before I merge?

@WojciechMula
Copy link
Contributor Author

Thanks :) And thank you for such great support. I think there's nothing to add.

@klauspost klauspost merged commit 2d457e5 into klauspost:master Mar 17, 2022
@WojciechMula WojciechMula deleted the asm-seqdec-decode branch March 17, 2022 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants