-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Avoid inserting null elements into join hash table when nulls are treated as unequal #6943
[REVIEW] Avoid inserting null elements into join hash table when nulls are treated as unequal #6943
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Here is the null benchmark running without this change. I will update this as it progresses, but the benchmark is laughable.
And with this change:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work
Thanks for the suggestions! Co-authored-by: Jake Hemstad <jhemstad@nvidia.com>
The reason it is so slow without this change is because when nulls are not equal but are inserted into the map, you end up with a huge chain in the hash map. Every time you do an insert of a null value, you end up having to serially iterate that chain. Each slot you visit in the chain requires an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice speedup! Could use a doc clarification.
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #6943 +/- ##
===============================================
+ Coverage 81.58% 82.02% +0.43%
===============================================
Files 96 96
Lines 15920 16275 +355
===============================================
+ Hits 12989 13350 +361
+ Misses 2931 2925 -6
Continue to review full report at Codecov.
|
Thanks for the suggestions. Co-authored-by: nvdbaranec <56695930+nvdbaranec@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have better random number generators. (not that randomness quality matters much for this benchmark, but it's good to be consistent across benchmarks)
Co-authored-by: Mark Harris <mharris@nvidia.com>
This change mirrors what is done in
groupby
to eliminate null-containing columns from the join hash table if nulls not equal is set. This prevents absolute runaway of the process. I added benchmarks for joins with nulls and I can't even get it to finish without these changes. The 195ms test without nulls takes 2,000,000ms to complete and the larger tests I haven't had the patience to even see complete. With this change, the timings are faster than without nulls proportional to the % of nulls. Meaning half the table is nulls means the query is twice as fast as the non-null version, which makes sense.closes #6052