One last improvement for thomaswue #702

thomaswue · 2024-01-31T21:27:03Z

Check List:

You have run ./mvnw verify and the project builds successfully
Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
All formatting changes by the build are committed
Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
Output matches that of calculate_average_baseline.sh
For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (line number)

Execution time: 0.31
Execution time of reference implementation: 120.37

OK, so can't let @jerrinot have all the fun alone ;-). I adopted the approach that combines the <8 and 1-16 cases to avoid branch misprediction, which was the biggest improvement; and then also the masking instead of bit shifting, which added another few %. Added @jerrinot and @abeobk to the credits section. This should now get into the ballpark of @jerrinot's submission if the results from my machine don't lie. Major differences to his solution are that I am using Result objects for the hash table instead of unsafe access and I am unrolling 3 times. Also, it avoids the name length check by having the delimiter in the word mask.

…Van Phu Do).

thomaswue · 2024-01-31T22:05:56Z

@gunnarmorling Added one last code layout change that improves by 1-2%, but that is really it now ;-).

jerrinot · 2024-01-31T23:42:04Z

@thomaswue challange accepted!
also: can we both agree that if you put null bytes into a string then you deserve to suffer? :)

royvanrijn · 2024-02-01T06:21:52Z

Wait, now everbody switches to processing 16 bytes? I was experimenting with that more than a week ago haha. 🤦🏻‍♂️

jerrinot · 2024-02-01T08:36:40Z

@royvanrijn :)))) I have no idea how you managed to do so well while benchmarking on M2! 🙇 🪄 I found perf. results on ARM to not be representative at all.

I included the 16-byte processing in this PR. @mtopolnik had been playing with a similar thing even before. It performed quite well from the start, but 2 major things were holding it back:

bit-twiddling increased the total instruction count significantly (as visible in perf stat attached to the PR). this was eventually solved by using @abeobk's lookup tables.
the state was too large and the compiler resorted to stack spilling. @franz1981 and @thomaswue helped me to identify it.

thomaswue · 2024-02-01T09:47:58Z

Based on the various experiments around the 16-byte processing, I also thought it would not benefit so much. But it seems like for my solution it was the last missing piece. It is amazing how once other bottlenecks are solved, suddenly a certain optimization can be super critical.

It completely changes the bottlenecks identified by perf stat from a lot of bad speculation:

Performance counter stats for './target/CalculateAverage_thomaswue_image':
      12,061.39 msec task-clock                       #   30.313 CPUs utilized             
              681      context-switches                 #   56.461 /sec                      
               99      cpu-migrations                   #    8.208 /sec                      
          220,203      page-faults                      #   18.257 K/sec                     
   25,250,300,213      cpu_core/cycles/                 #    2.093 GHz                         (93.88%)
   26,877,576,823      cpu_atom/cycles/                 #    2.228 GHz                         (46.36%)
   42,740,848,935      cpu_core/instructions/           #    1.69  insn per cycle              (93.88%)
   57,425,448,002      cpu_atom/instructions/           #    2.27  insn per cycle              (55.29%)
    3,938,912,502      cpu_core/branches/               #  326.572 M/sec                       (93.88%)
    5,271,137,390      cpu_atom/branches/               #  437.026 M/sec                       (55.61%)
      246,089,883      cpu_core/branch-misses/          #    6.25% of all branches             (93.88%)
      329,280,193      cpu_atom/branch-misses/          #    8.36% of all branches             (56.92%)
            TopdownL1 (cpu_core)                 #     12.8 %  tma_backend_bound      
                                                 #     24.9 %  tma_bad_speculation    
                                                 #      7.9 %  tma_frontend_bound     
                                                 #     54.4 %  tma_retiring             (93.88%)
            TopdownL1 (cpu_atom)                 #     39.8 %  tma_bad_speculation      (57.63%)
                                                 #      6.6 %  tma_frontend_bound       (57.99%)
                                                 #     12.9 %  tma_backend_bound      
                                                 #     12.9 %  tma_backend_bound_aux    (58.20%)
                                                 #     39.6 %  tma_retiring             (57.62%)

      0.397893788 seconds time elapsed

      0.000000000 seconds user
      0.006337000 seconds sys

To almost no bad speculation and much higher instructions per cycle and almost no branch misses anymore:

 Performance counter stats for './target/CalculateAverage_thomaswue_image':
             9,778.04 msec task-clock                       #   30.927 CPUs utilized             
               811      context-switches                 #   82.941 /sec                      
                81      cpu-migrations                   #    8.284 /sec                      
           218,792      page-faults                      #   22.376 K/sec                     
    21,654,496,541      cpu_core/cycles/                 #    2.215 GHz                         (88.73%)
    19,474,462,268      cpu_atom/cycles/                 #    1.992 GHz                         (54.36%)
    49,328,285,361      cpu_core/instructions/           #    2.28  insn per cycle              (88.73%)
    68,071,427,732      cpu_atom/instructions/           #    3.14  insn per cycle              (59.45%)
     3,690,652,954      cpu_core/branches/               #  377.443 M/sec                       (88.73%)
     5,032,375,583      cpu_atom/branches/               #  514.661 M/sec                       (59.45%)
        14,111,445      cpu_core/branch-misses/          #    0.38% of all branches             (88.73%)
        19,354,582      cpu_atom/branch-misses/          #    0.52% of all branches             (59.46%)
             TopdownL1 (cpu_core)                 #     18.7 %  tma_backend_bound      
                                                  #      2.2 %  tma_bad_speculation    
                                                  #      5.8 %  tma_frontend_bound     
                                                  #     73.3 %  tma_retiring             (88.73%)
             TopdownL1 (cpu_atom)                 #      4.5 %  tma_bad_speculation      (59.45%)
                                                  #      1.8 %  tma_frontend_bound       (59.45%)
                                                  #     28.7 %  tma_backend_bound      
                                                  #     28.7 %  tma_backend_bound_aux    (59.48%)
                                                  #     66.7 %  tma_retiring             (60.12%)

thomaswue · 2024-02-01T09:53:09Z

Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.

gunnarmorling · 2024-02-01T09:56:51Z

Nice! Amazing to see this level of improvement at this point. Thanks a lot for participating in 1BRC, and congrats on implementing one of the very top entries, and keeping pushing updates over all the month, @thomaswue!

Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
  Time (mean ± σ):      1.532 s ±  0.014 s    [User: 0.002 s, System: 0.003 s]
  Range (min … max):    1.495 s …  1.545 s    10 runs

Summary
  thomaswue: trimmed mean 1.535478878755, raw times 1.49517163788,1.53168578688,1.5320476678800001,1.53248459588,1.53386057488,1.54497609988,1.53989302288,1.53518686588,1.5412673778800001,1.53740513788

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:01.535 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |

jerrinot · 2024-02-01T09:58:10Z

@thomaswue congratulations!

thomaswue · 2024-02-01T10:08:40Z

Wow, cool, thank you, that is even faster on the evaluation server than expected! Nice to see how the solution can very much compete while keeping the core Result data structure in the style of a Java class.

As referenced in the credit and co-author section, I got a lot of inspiration and borrowed ideas from @merykitty, @mukel, @artsiomkorzun, @jerrinot, and @abeobk. Thank you guys, it was a blast, I enjoyed it a lot (as you could see)! And I also learned a lot. Some of those learnings will go into future Graal compiler optimizations and tunings for sure!

franz1981 · 2024-02-01T10:20:51Z

which translates in: know the pattern of your data, your domain, and you can get an immensely faster program.
It would be nice to see now, after the rightly applied tradeoffs how an high-level java program could perform now, without relying on Unsafe

ie the right stride to parse, branchless parsing, decent locality (without Unsafe)

thomaswue · 2024-02-01T10:32:28Z

It should be possible to go without unsafe and without performance loss, but it will require some really careful crafting to make sure all bounds checks are combined/moved to mitigate their effect. The memory API also has a few more checks than just the bounds (thread owner, scope, temporal check). I tried to use it and looked at the compiler graph.

We will see if Graal compiler can optimize those specially maybe or if some API additions might be necessary.

The other aspect we will add to Graal for sure is auto-vectorization of the search loop. That would avoid all that SWAR/mask complexity and produce decent vector code for the search automatically. We do that for the String indexOf methods already, but not for general user code.

franz1981 · 2024-02-01T10:37:22Z

The other aspect we will add to Graal for sure is auto-vectorization of the search loop

that is indeed a very nice thing which C2 can achieve, sometime, under certain conditions (which is an interesting combination of words :P ), but I still believe that having an intrinsic for a declarative MemorySegment::indexOf(byte) API to be a good idea as well, unless the autovectorization can always make it happen, regardless conditions like inlining-level etc etc
There are many cases where such things are needed (I've implemented manually SWAR on Netty's off-heap/on-heap exactly because I couldn't rely on ANY compiler to do int right, always), and I think others in the java middleware data-processing landscape would benefit about it (ElasticSearch? ).

thomaswue · 2024-02-01T10:42:03Z

Yes, I think the best approach is to implement it as a general compiler transform and offer optional utility methods that if used are guaranteed to trigger the optimization.

tivrfoa · 2024-02-01T12:30:35Z

Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.

That's very interesting!!!

The number of instructions vs branch misses is probably one of the reasons why some results are very different depending on the HW.

Also, OS configuration might play a role, because of the patches against meltdown and spectre vulnerabilities.

gunnarmorling · 2024-02-11T11:12:59Z

Hey @thomaswue, @merykitty, and @mukel!

Congrats again on being in the Top 3 of the One Billion Row Challenge!

To celebrate this amazing achievement, I would like to send you a 1BRC t-shirt and coffee mug. To claim your prize, fill out this form by Feb 18. After submitting the form, please provide a comment with the random value you've specified in the form, so that I know it is you who submitted it.

All data entered will solely be used in relation to processing this shipment. Shipments can be sent to any country listed here. A big thank you to Decodable for sponsoring these prizes!

Thanks a lot for participating in 1BRC,

--Gunnar

gunnarmorling · 2024-02-11T12:50:14Z

Hey @mtopolnik, thx for replying. I've tagged you on your own PR with the right link to the Top 20 form. Nice try to get that Top 3 t-shirt though ;)

mtopolnik · 2024-02-11T12:56:22Z

Oops... I got notifications for a bunch of them and had all of them open in the browser.

thomaswue · 2024-02-11T22:44:36Z

Thank you so much @gunnarmorling! Looking forward to proudly wear that #1brc shirt in some of my next talks ;-).

Number is 9597 9145

mukel · 2024-02-13T05:51:46Z

Hi @gunnarmorling, my random tokens: 3728 28670

thomaswue added 5 commits January 31, 2024 19:55

Combine <8 and 8-16 cases into one case.

096dbee

Adopt mask-based approach for the <16 length city fast path (idea of …

04c0d56

…Van Phu Do).

Merge branch 'gunnarmorling:main' into main

1ae539e

Slightly improved code layout.

854177d

Update perf number.

dfed7c1

jerrinot mentioned this pull request Jan 31, 2024

jerrinot: a few random things #709

Merged

6 tasks

gunnarmorling merged commit 241d42c into gunnarmorling:main Feb 1, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One last improvement for thomaswue #702

One last improvement for thomaswue #702

thomaswue commented Jan 31, 2024 •

edited

Loading

thomaswue commented Jan 31, 2024

jerrinot commented Jan 31, 2024

royvanrijn commented Feb 1, 2024

jerrinot commented Feb 1, 2024

thomaswue commented Feb 1, 2024

thomaswue commented Feb 1, 2024 •

edited

Loading

gunnarmorling commented Feb 1, 2024

jerrinot commented Feb 1, 2024

thomaswue commented Feb 1, 2024

franz1981 commented Feb 1, 2024

thomaswue commented Feb 1, 2024

franz1981 commented Feb 1, 2024 •

edited

Loading

thomaswue commented Feb 1, 2024

tivrfoa commented Feb 1, 2024

gunnarmorling commented Feb 11, 2024 •

edited

Loading

gunnarmorling commented Feb 11, 2024

mtopolnik commented Feb 11, 2024

thomaswue commented Feb 11, 2024

mukel commented Feb 13, 2024

One last improvement for thomaswue #702

One last improvement for thomaswue #702

Conversation

thomaswue commented Jan 31, 2024 • edited Loading

Check List:

thomaswue commented Jan 31, 2024

jerrinot commented Jan 31, 2024

royvanrijn commented Feb 1, 2024

jerrinot commented Feb 1, 2024

thomaswue commented Feb 1, 2024

thomaswue commented Feb 1, 2024 • edited Loading

gunnarmorling commented Feb 1, 2024

jerrinot commented Feb 1, 2024

thomaswue commented Feb 1, 2024

franz1981 commented Feb 1, 2024

thomaswue commented Feb 1, 2024

franz1981 commented Feb 1, 2024 • edited Loading

thomaswue commented Feb 1, 2024

tivrfoa commented Feb 1, 2024

gunnarmorling commented Feb 11, 2024 • edited Loading

gunnarmorling commented Feb 11, 2024

mtopolnik commented Feb 11, 2024

thomaswue commented Feb 11, 2024

mukel commented Feb 13, 2024

thomaswue commented Jan 31, 2024 •

edited

Loading

thomaswue commented Feb 1, 2024 •

edited

Loading

franz1981 commented Feb 1, 2024 •

edited

Loading

gunnarmorling commented Feb 11, 2024 •

edited

Loading