Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huff0: Assembler improvements #736

Merged
merged 1 commit into from
Jan 9, 2023
Merged

Commits on Jan 8, 2023

  1. huff0: Assembler improvements

    Main changes:
    
    * Compute out[id * dstEvery + i] statically. This shaves four
      instructions off the main loops. (It also frees up a register.)
    
    * Track "exhausted" by addition instead or OR. This gets rid of an
      additional instruction. The variable is now also zeroed inside the
      loop as a dependency hint.
    
    Benchmark results show small speedups on some datasets:
    
    name                                           old speed      new speed      delta
    Decompress1XTable/digits-8                      350MB/s ± 0%   350MB/s ± 1%    ~     (p=0.764 n=10+9)
    Decompress1XTable/gettysburg-8                  270MB/s ± 1%   268MB/s ± 1%  -0.72%  (p=0.001 n=10+10)
    Decompress1XTable/twain-8                       329MB/s ± 1%   328MB/s ± 0%    ~     (p=0.035 n=10+9)
    Decompress1XTable/low-ent.10k-8                 387MB/s ± 1%   386MB/s ± 0%    ~     (p=0.027 n=10+8)
    Decompress1XTable/superlow-ent-10k-8            377MB/s ± 0%   375MB/s ± 0%  -0.48%  (p=0.000 n=10+10)
    Decompress1XTable/crash2-8                     17.0MB/s ± 0%  16.9MB/s ± 0%  -0.36%  (p=0.004 n=9+10)
    Decompress1XTable/endzerobits-8                53.3MB/s ± 0%  53.0MB/s ± 0%  -0.55%  (p=0.000 n=10+9)
    Decompress1XTable/endnonzero-8                 11.3MB/s ± 0%  11.3MB/s ± 1%    ~     (p=0.060 n=10+10)
    Decompress1XTable/case1-8                      22.0MB/s ± 0%  21.9MB/s ± 1%    ~     (p=0.015 n=9+9)
    Decompress1XTable/case2-8                      18.1MB/s ± 1%  18.1MB/s ± 1%    ~     (p=0.202 n=10+9)
    Decompress1XTable/case3-8                      19.1MB/s ± 1%  19.2MB/s ± 1%    ~     (p=0.056 n=9+10)
    Decompress1XTable/pngdata.001-8                 374MB/s ± 0%   374MB/s ± 0%    ~     (p=0.148 n=10+10)
    Decompress1XTable/normcount2-8                 54.4MB/s ± 1%  54.4MB/s ± 1%    ~     (p=0.617 n=10+10)
    Decompress1XNoTable/digits/100-8                280MB/s ± 0%   280MB/s ± 1%    ~     (p=0.951 n=9+10)
    Decompress1XNoTable/digits/10000-8              366MB/s ± 1%   367MB/s ± 0%    ~     (p=0.090 n=10+9)
    Decompress1XNoTable/digits/262143-8             348MB/s ± 1%   349MB/s ± 0%    ~     (p=0.043 n=10+10)
    Decompress1XNoTable/gettysburg/100-8            276MB/s ± 0%   277MB/s ± 1%  +0.44%  (p=0.009 n=10+10)
    Decompress1XNoTable/gettysburg/10000-8          363MB/s ± 1%   363MB/s ± 0%    ~     (p=0.041 n=10+7)
    Decompress1XNoTable/gettysburg/262143-8         349MB/s ± 1%   350MB/s ± 0%    ~     (p=0.123 n=10+10)
    Decompress1XNoTable/twain/100-8                 267MB/s ± 0%   268MB/s ± 0%    ~     (p=0.052 n=10+10)
    Decompress1XNoTable/twain/10000-8               357MB/s ± 3%   363MB/s ± 0%  +1.74%  (p=0.000 n=10+10)
    Decompress1XNoTable/twain/262143-8              320MB/s ± 2%   329MB/s ± 0%  +3.09%  (p=0.000 n=10+10)
    Decompress1XNoTable/low-ent.10k/100-8           183MB/s ± 1%   184MB/s ± 0%    ~     (p=0.211 n=9+10)
    Decompress1XNoTable/low-ent.10k/10000-8         377MB/s ± 3%   385MB/s ± 1%  +2.14%  (p=0.000 n=10+10)
    Decompress1XNoTable/low-ent.10k/262143-8        386MB/s ± 1%   389MB/s ± 1%  +0.84%  (p=0.005 n=10+10)
    Decompress1XNoTable/superlow-ent-10k/262143-8   382MB/s ± 2%   389MB/s ± 1%  +1.89%  (p=0.001 n=10+10)
    Decompress1XNoTable/crash2/100-8                276MB/s ± 2%   278MB/s ± 0%    ~     (p=0.180 n=10+8)
    Decompress1XNoTable/crash2/10000-8              373MB/s ± 1%   374MB/s ± 1%    ~     (p=0.315 n=10+10)
    Decompress1XNoTable/crash2/262143-8             373MB/s ± 1%   375MB/s ± 0%    ~     (p=0.165 n=10+8)
    Decompress1XNoTable/endzerobits/100-8           184MB/s ± 0%   184MB/s ± 1%    ~     (p=0.845 n=9+9)
    Decompress1XNoTable/endzerobits/10000-8         384MB/s ± 1%   386MB/s ± 0%  +0.61%  (p=0.007 n=10+10)
    Decompress1XNoTable/endzerobits/262143-8        387MB/s ± 2%   389MB/s ± 0%    ~     (p=0.963 n=9+8)
    Decompress1XNoTable/endnonzero/100-8            181MB/s ± 2%   183MB/s ± 0%    ~     (p=0.017 n=9+10)
    Decompress1XNoTable/endnonzero/10000-8          385MB/s ± 0%   382MB/s ± 1%  -0.88%  (p=0.001 n=8+10)
    Decompress1XNoTable/endnonzero/262143-8         387MB/s ± 1%   385MB/s ± 2%    ~     (p=0.143 n=10+10)
    Decompress1XNoTable/case1/100-8                 278MB/s ± 2%   282MB/s ± 1%    ~     (p=0.013 n=10+9)
    Decompress1XNoTable/case1/10000-8               373MB/s ± 1%   373MB/s ± 0%    ~     (p=0.274 n=10+8)
    Decompress1XNoTable/case1/262143-8              374MB/s ± 1%   374MB/s ± 0%    ~     (p=0.589 n=10+9)
    Decompress1XNoTable/case2/100-8                 274MB/s ± 0%   274MB/s ± 0%  -0.26%  (p=0.002 n=10+9)
    Decompress1XNoTable/case2/10000-8               378MB/s ± 0%   377MB/s ± 0%    ~     (p=0.093 n=10+10)
    Decompress1XNoTable/case2/262143-8              377MB/s ± 1%   376MB/s ± 1%    ~     (p=0.225 n=10+10)
    Decompress1XNoTable/case3/100-8                 266MB/s ± 0%   265MB/s ± 0%  -0.20%  (p=0.007 n=10+9)
    Decompress1XNoTable/case3/10000-8               371MB/s ± 0%   372MB/s ± 0%    ~     (p=0.211 n=10+9)
    Decompress1XNoTable/case3/262143-8              373MB/s ± 0%   374MB/s ± 0%    ~     (p=0.073 n=10+10)
    Decompress1XNoTable/pngdata.001/100-8           239MB/s ± 0%   239MB/s ± 0%    ~     (p=0.889 n=9+10)
    Decompress1XNoTable/pngdata.001/10000-8         384MB/s ± 0%   384MB/s ± 0%    ~     (p=0.228 n=10+8)
    Decompress1XNoTable/pngdata.001/262143-8        377MB/s ± 0%   379MB/s ± 0%  +0.56%  (p=0.000 n=10+10)
    Decompress1XNoTable/normcount2/100-8            281MB/s ± 1%   282MB/s ± 1%    ~     (p=0.015 n=10+10)
    Decompress1XNoTable/normcount2/10000-8          368MB/s ± 0%   370MB/s ± 0%  +0.37%  (p=0.004 n=10+10)
    Decompress1XNoTable/normcount2/262143-8         371MB/s ± 0%   371MB/s ± 0%    ~     (p=0.034 n=8+10)
    Decompress4XNoTable/digits/100-8                200MB/s ± 1%   201MB/s ± 0%    ~     (p=0.274 n=8+10)
    Decompress4XNoTable/digits/10000-8              603MB/s ± 0%   622MB/s ± 1%  +3.20%  (p=0.000 n=8+10)
    Decompress4XNoTable/digits/262143-8             578MB/s ± 0%   595MB/s ± 1%  +2.87%  (p=0.000 n=8+10)
    Decompress4XNoTable/gettysburg/100-8            260MB/s ± 0%   260MB/s ± 1%    ~     (p=0.011 n=8+10)
    Decompress4XNoTable/gettysburg/10000-8          643MB/s ± 0%   657MB/s ± 1%  +2.19%  (p=0.000 n=10+9)
    Decompress4XNoTable/gettysburg/262143-8         572MB/s ± 0%   589MB/s ± 0%  +2.93%  (p=0.000 n=8+10)
    Decompress4XNoTable/twain/100-8                 206MB/s ± 1%   206MB/s ± 1%    ~     (p=0.436 n=10+10)
    Decompress4XNoTable/twain/10000-8               639MB/s ± 1%   653MB/s ± 1%  +2.25%  (p=0.000 n=10+10)
    Decompress4XNoTable/twain/262143-8              516MB/s ± 0%   522MB/s ± 1%  +1.09%  (p=0.004 n=10+10)
    Decompress4XNoTable/low-ent.10k/100-8           207MB/s ± 1%   207MB/s ± 0%    ~     (p=1.000 n=10+9)
    Decompress4XNoTable/low-ent.10k/10000-8         631MB/s ± 0%   653MB/s ± 0%  +3.42%  (p=0.000 n=10+9)
    Decompress4XNoTable/low-ent.10k/262143-8        685MB/s ± 1%   696MB/s ± 0%  +1.61%  (p=0.000 n=10+10)
    Decompress4XNoTable/superlow-ent-10k/262143-8   684MB/s ± 1%   695MB/s ± 1%  +1.51%  (p=0.000 n=9+10)
    Decompress4XNoTable/case1/100-8                 208MB/s ± 1%   207MB/s ± 0%    ~     (p=0.353 n=10+10)
    Decompress4XNoTable/case1/10000-8               601MB/s ± 0%   621MB/s ± 1%  +3.22%  (p=0.000 n=10+10)
    Decompress4XNoTable/case1/262143-8              613MB/s ± 1%   632MB/s ± 0%  +3.14%  (p=0.000 n=10+10)
    Decompress4XNoTable/case2/100-8                 210MB/s ± 2%   208MB/s ± 2%    ~     (p=0.315 n=10+9)
    Decompress4XNoTable/case2/10000-8               618MB/s ± 0%   636MB/s ± 0%  +2.95%  (p=0.000 n=10+10)
    Decompress4XNoTable/case2/262143-8              635MB/s ± 0%   651MB/s ± 0%  +2.56%  (p=0.000 n=7+10)
    Decompress4XNoTable/case3/100-8                 199MB/s ± 1%   200MB/s ± 1%    ~     (p=0.055 n=10+10)
    Decompress4XNoTable/case3/10000-8               615MB/s ± 0%   633MB/s ± 1%  +2.94%  (p=0.000 n=10+10)
    Decompress4XNoTable/case3/262143-8              620MB/s ± 0%   639MB/s ± 1%  +3.00%  (p=0.000 n=10+10)
    Decompress4XNoTable/pngdata.001/100-8           212MB/s ± 0%   211MB/s ± 1%    ~     (p=0.211 n=10+9)
    Decompress4XNoTable/pngdata.001/10000-8         649MB/s ± 0%   667MB/s ± 1%  +2.76%  (p=0.000 n=10+10)
    Decompress4XNoTable/pngdata.001/262143-8        646MB/s ± 0%   660MB/s ± 0%  +2.28%  (p=0.000 n=9+10)
    Decompress4XNoTable/normcount2/100-8            261MB/s ± 1%   262MB/s ± 1%    ~     (p=0.031 n=9+9)
    Decompress4XNoTable/normcount2/10000-8          589MB/s ± 1%   613MB/s ± 0%  +3.99%  (p=0.000 n=10+9)
    Decompress4XNoTable/normcount2/262143-8         585MB/s ± 3%   617MB/s ± 1%  +5.57%  (p=0.000 n=10+10)
    Decompress4XNoTableTableLog8/digits-8           579MB/s ± 2%   610MB/s ± 0%  +5.33%  (p=0.000 n=10+10)
    Decompress4XTable/digits-8                      584MB/s ± 1%   607MB/s ± 1%  +3.89%  (p=0.000 n=10+10)
    Decompress4XTable/gettysburg-8                  370MB/s ± 0%   373MB/s ± 1%  +0.67%  (p=0.009 n=10+10)
    Decompress4XTable/twain-8                       512MB/s ± 2%   523MB/s ± 1%  +2.08%  (p=0.000 n=9+10)
    Decompress4XTable/low-ent.10k-8                 656MB/s ± 1%   677MB/s ± 1%  +3.21%  (p=0.000 n=10+10)
    Decompress4XTable/superlow-ent-10k-8            603MB/s ± 4%   626MB/s ± 1%  +3.91%  (p=0.000 n=9+10)
    Decompress4XTable/case1-8                      21.1MB/s ± 0%  21.0MB/s ± 0%  -0.55%  (p=0.000 n=9+9)
    Decompress4XTable/case2-8                      17.6MB/s ± 0%  17.6MB/s ± 1%    ~     (p=0.736 n=9+10)
    Decompress4XTable/case3-8                      18.7MB/s ± 1%  18.7MB/s ± 1%    ~     (p=0.642 n=10+10)
    Decompress4XTable/pngdata.001-8                 648MB/s ± 0%   657MB/s ± 0%  +1.50%  (p=0.000 n=10+8)
    Decompress4XTable/normcount2-8                 49.7MB/s ± 1%  49.7MB/s ± 1%    ~     (p=0.839 n=10+10)
    [Geo mean]                                      271MB/s        274MB/s       +0.96%
    greatroar committed Jan 8, 2023
    Configuration menu
    Copy the full SHA
    3683ed4 View commit details
    Browse the repository at this point in the history