Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One last improvement for thomaswue #702

Merged
merged 5 commits into from
Feb 1, 2024
Merged

Conversation

thomaswue
Copy link
Contributor

@thomaswue thomaswue commented Jan 31, 2024

Check List:

  • You have run ./mvnw verify and the project builds successfully
  • Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
  • All formatting changes by the build are committed
  • Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
  • Output matches that of calculate_average_baseline.sh
  • For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (line number)
  • Execution time: 0.31
  • Execution time of reference implementation: 120.37

OK, so can't let @jerrinot have all the fun alone ;-). I adopted the approach that combines the <8 and 1-16 cases to avoid branch misprediction, which was the biggest improvement; and then also the masking instead of bit shifting, which added another few %. Added @jerrinot and @abeobk to the credits section. This should now get into the ballpark of @jerrinot's submission if the results from my machine don't lie. Major differences to his solution are that I am using Result objects for the hash table instead of unsafe access and I am unrolling 3 times. Also, it avoids the name length check by having the delimiter in the word mask.

@thomaswue
Copy link
Contributor Author

@gunnarmorling Added one last code layout change that improves by 1-2%, but that is really it now ;-).

@jerrinot jerrinot mentioned this pull request Jan 31, 2024
6 tasks
@jerrinot
Copy link
Contributor

@thomaswue challange accepted!
also: can we both agree that if you put null bytes into a string then you deserve to suffer? :)

@royvanrijn
Copy link
Contributor

Wait, now everbody switches to processing 16 bytes? I was experimenting with that more than a week ago haha. 🤦🏻‍♂️

@jerrinot
Copy link
Contributor

jerrinot commented Feb 1, 2024

@royvanrijn :)))) I have no idea how you managed to do so well while benchmarking on M2! 🙇 🪄 I found perf. results on ARM to not be representative at all.

I included the 16-byte processing in this PR. @mtopolnik had been playing with a similar thing even before. It performed quite well from the start, but 2 major things were holding it back:

  1. bit-twiddling increased the total instruction count significantly (as visible in perf stat attached to the PR). this was eventually solved by using @abeobk's lookup tables.
  2. the state was too large and the compiler resorted to stack spilling. @franz1981 and @thomaswue helped me to identify it.

@thomaswue
Copy link
Contributor Author

Based on the various experiments around the 16-byte processing, I also thought it would not benefit so much. But it seems like for my solution it was the last missing piece. It is amazing how once other bottlenecks are solved, suddenly a certain optimization can be super critical.

It completely changes the bottlenecks identified by perf stat from a lot of bad speculation:

Performance counter stats for './target/CalculateAverage_thomaswue_image':
      12,061.39 msec task-clock                       #   30.313 CPUs utilized             
              681      context-switches                 #   56.461 /sec                      
               99      cpu-migrations                   #    8.208 /sec                      
          220,203      page-faults                      #   18.257 K/sec                     
   25,250,300,213      cpu_core/cycles/                 #    2.093 GHz                         (93.88%)
   26,877,576,823      cpu_atom/cycles/                 #    2.228 GHz                         (46.36%)
   42,740,848,935      cpu_core/instructions/           #    1.69  insn per cycle              (93.88%)
   57,425,448,002      cpu_atom/instructions/           #    2.27  insn per cycle              (55.29%)
    3,938,912,502      cpu_core/branches/               #  326.572 M/sec                       (93.88%)
    5,271,137,390      cpu_atom/branches/               #  437.026 M/sec                       (55.61%)
      246,089,883      cpu_core/branch-misses/          #    6.25% of all branches             (93.88%)
      329,280,193      cpu_atom/branch-misses/          #    8.36% of all branches             (56.92%)
            TopdownL1 (cpu_core)                 #     12.8 %  tma_backend_bound      
                                                 #     24.9 %  tma_bad_speculation    
                                                 #      7.9 %  tma_frontend_bound     
                                                 #     54.4 %  tma_retiring             (93.88%)
            TopdownL1 (cpu_atom)                 #     39.8 %  tma_bad_speculation      (57.63%)
                                                 #      6.6 %  tma_frontend_bound       (57.99%)
                                                 #     12.9 %  tma_backend_bound      
                                                 #     12.9 %  tma_backend_bound_aux    (58.20%)
                                                 #     39.6 %  tma_retiring             (57.62%)

      0.397893788 seconds time elapsed

      0.000000000 seconds user
      0.006337000 seconds sys

To almost no bad speculation and much higher instructions per cycle and almost no branch misses anymore:

 Performance counter stats for './target/CalculateAverage_thomaswue_image':
             9,778.04 msec task-clock                       #   30.927 CPUs utilized             
               811      context-switches                 #   82.941 /sec                      
                81      cpu-migrations                   #    8.284 /sec                      
           218,792      page-faults                      #   22.376 K/sec                     
    21,654,496,541      cpu_core/cycles/                 #    2.215 GHz                         (88.73%)
    19,474,462,268      cpu_atom/cycles/                 #    1.992 GHz                         (54.36%)
    49,328,285,361      cpu_core/instructions/           #    2.28  insn per cycle              (88.73%)
    68,071,427,732      cpu_atom/instructions/           #    3.14  insn per cycle              (59.45%)
     3,690,652,954      cpu_core/branches/               #  377.443 M/sec                       (88.73%)
     5,032,375,583      cpu_atom/branches/               #  514.661 M/sec                       (59.45%)
        14,111,445      cpu_core/branch-misses/          #    0.38% of all branches             (88.73%)
        19,354,582      cpu_atom/branch-misses/          #    0.52% of all branches             (59.46%)
             TopdownL1 (cpu_core)                 #     18.7 %  tma_backend_bound      
                                                  #      2.2 %  tma_bad_speculation    
                                                  #      5.8 %  tma_frontend_bound     
                                                  #     73.3 %  tma_retiring             (88.73%)
             TopdownL1 (cpu_atom)                 #      4.5 %  tma_bad_speculation      (59.45%)
                                                  #      1.8 %  tma_frontend_bound       (59.45%)
                                                  #     28.7 %  tma_backend_bound      
                                                  #     28.7 %  tma_backend_bound_aux    (59.48%)
                                                  #     66.7 %  tma_retiring             (60.12%)

@thomaswue
Copy link
Contributor Author

thomaswue commented Feb 1, 2024

Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.

@gunnarmorling
Copy link
Owner

Nice! Amazing to see this level of improvement at this point. Thanks a lot for participating in 1BRC, and congrats on implementing one of the very top entries, and keeping pushing updates over all the month, @thomaswue!

Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
  Time (mean ± σ):      1.532 s ±  0.014 s    [User: 0.002 s, System: 0.003 s]
  Range (min … max):    1.495 s …  1.545 s    10 runs

Summary
  thomaswue: trimmed mean 1.535478878755, raw times 1.49517163788,1.53168578688,1.5320476678800001,1.53248459588,1.53386057488,1.54497609988,1.53989302288,1.53518686588,1.5412673778800001,1.53740513788

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:01.535 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |

@gunnarmorling gunnarmorling merged commit 241d42c into gunnarmorling:main Feb 1, 2024
1 check passed
@jerrinot
Copy link
Contributor

jerrinot commented Feb 1, 2024

@thomaswue congratulations!

@thomaswue
Copy link
Contributor Author

Wow, cool, thank you, that is even faster on the evaluation server than expected! Nice to see how the solution can very much compete while keeping the core Result data structure in the style of a Java class.

As referenced in the credit and co-author section, I got a lot of inspiration and borrowed ideas from @merykitty, @mukel, @artsiomkorzun, @jerrinot, and @abeobk. Thank you guys, it was a blast, I enjoyed it a lot (as you could see)! And I also learned a lot. Some of those learnings will go into future Graal compiler optimizations and tunings for sure!

@franz1981
Copy link

which translates in: know the pattern of your data, your domain, and you can get an immensely faster program.
It would be nice to see now, after the rightly applied tradeoffs how an high-level java program could perform now, without relying on Unsafe

ie the right stride to parse, branchless parsing, decent locality (without Unsafe)

@thomaswue
Copy link
Contributor Author

It should be possible to go without unsafe and without performance loss, but it will require some really careful crafting to make sure all bounds checks are combined/moved to mitigate their effect. The memory API also has a few more checks than just the bounds (thread owner, scope, temporal check). I tried to use it and looked at the compiler graph.

We will see if Graal compiler can optimize those specially maybe or if some API additions might be necessary.

The other aspect we will add to Graal for sure is auto-vectorization of the search loop. That would avoid all that SWAR/mask complexity and produce decent vector code for the search automatically. We do that for the String indexOf methods already, but not for general user code.

@franz1981
Copy link

franz1981 commented Feb 1, 2024

The other aspect we will add to Graal for sure is auto-vectorization of the search loop

that is indeed a very nice thing which C2 can achieve, sometime, under certain conditions (which is an interesting combination of words :P ), but I still believe that having an intrinsic for a declarative MemorySegment::indexOf(byte) API to be a good idea as well, unless the autovectorization can always make it happen, regardless conditions like inlining-level etc etc
There are many cases where such things are needed (I've implemented manually SWAR on Netty's off-heap/on-heap exactly because I couldn't rely on ANY compiler to do int right, always), and I think others in the java middleware data-processing landscape would benefit about it (ElasticSearch? ).

@thomaswue
Copy link
Contributor Author

Yes, I think the best approach is to implement it as a general compiler transform and offer optional utility methods that if used are guaranteed to trigger the optimization.

@tivrfoa
Copy link
Contributor

tivrfoa commented Feb 1, 2024

Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.

That's very interesting!!!

The number of instructions vs branch misses is probably one of the reasons why some results are very different depending on the HW.

Also, OS configuration might play a role, because of the patches against meltdown and spectre vulnerabilities.

@gunnarmorling
Copy link
Owner

gunnarmorling commented Feb 11, 2024

Hey @thomaswue, @merykitty, and @mukel!

Congrats again on being in the Top 3 of the One Billion Row Challenge!

To celebrate this amazing achievement, I would like to send you a 1BRC t-shirt and coffee mug. To claim your prize, fill out this form by Feb 18. After submitting the form, please provide a comment with the random value you've specified in the form, so that I know it is you who submitted it.

All data entered will solely be used in relation to processing this shipment. Shipments can be sent to any country listed here. A big thank you to Decodable for sponsoring these prizes!

Thanks a lot for participating in 1BRC,

--Gunnar

@gunnarmorling
Copy link
Owner

Hey @mtopolnik, thx for replying. I've tagged you on your own PR with the right link to the Top 20 form. Nice try to get that Top 3 t-shirt though ;)

@mtopolnik
Copy link
Contributor

Oops... I got notifications for a bunch of them and had all of them open in the browser.

@thomaswue
Copy link
Contributor Author

Thank you so much @gunnarmorling! Looking forward to proudly wear that #1brc shirt in some of my next talks ;-).

Number is 9597 9145

@mukel
Copy link
Contributor

mukel commented Feb 13, 2024

Hi @gunnarmorling, my random tokens: 3728 28670

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants