Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Avoid inserting null elements into join hash table when nulls are treated as unequal #6943

Merged
merged 19 commits into from
Dec 11, 2020

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Dec 8, 2020

This change mirrors what is done in groupby to eliminate null-containing columns from the join hash table if nulls not equal is set. This prevents absolute runaway of the process. I added benchmarks for joins with nulls and I can't even get it to finish without these changes. The 195ms test without nulls takes 2,000,000ms to complete and the larger tests I haven't had the patience to even see complete. With this change, the timings are faster than without nulls proportional to the % of nulls. Meaning half the table is nulls means the query is twice as fast as the non-null version, which makes sense.

closes #6052

@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner December 8, 2020 19:37
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@hyperbolic2346
Copy link
Contributor Author

hyperbolic2346 commented Dec 8, 2020

Here is the null benchmark running without this change. I will update this as it progresses, but the benchmark is laughable.

Running ./cpp/build/release/gbenchmarks/JOIN_BENCH
Run on (12 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1024 KiB (x6)
  L3 Unified 8448 KiB (x1)
Load Average: 4.03, 4.33, 4.34
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/join_32bit/100000/100000/manual_time                  0.241 ms        0.257 ms         2912
Join<int32_t, int32_t>/join_32bit/100000/400000/manual_time                  0.319 ms        0.334 ms         2177
Join<int32_t, int32_t>/join_32bit/100000/1000000/manual_time                 0.583 ms        0.601 ms         1200
Join<int32_t, int32_t>/join_32bit/10000000/10000000/manual_time               19.0 ms         19.0 ms           37
Join<int32_t, int32_t>/join_32bit/10000000/40000000/manual_time               51.4 ms         51.4 ms           14
Join<int32_t, int32_t>/join_32bit/10000000/100000000/manual_time              72.0 ms         72.0 ms           10
Join<int32_t, int32_t>/join_32bit/100000000/100000000/manual_time              195 ms          195 ms            4
Join<int32_t, int32_t>/join_32bit/80000000/240000000/manual_time               337 ms          337 ms            2
Join<int64_t, int64_t>/join_64bit/50000000/50000000/manual_time               99.2 ms         99.3 ms            7
Join<int64_t, int64_t>/join_64bit/40000000/120000000/manual_time               172 ms          172 ms            4
Join<int32_t, int32_t>/join_32bit_nulls/100000/100000/manual_time              406 ms          406 ms            2
Join<int32_t, int32_t>/join_32bit_nulls/100000/400000/manual_time             1124 ms         1124 ms            1
Join<int32_t, int32_t>/join_32bit_nulls/100000/1000000/manual_time            2712 ms         2712 ms            1
Join<int32_t, int32_t>/join_32bit_nulls/10000000/10000000/manual_time      2660784 ms      2660781 ms            1
...

And with this change:

Running ./cpp/build/release/gbenchmarks/JOIN_BENCH
Run on (12 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1024 KiB (x6)
  L3 Unified 8448 KiB (x1)
Load Average: 4.00, 4.00, 4.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------------------------------
Benchmark                                                                        Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/join_32bit/100000/100000/manual_time                  0.241 ms        0.257 ms         2864
Join<int32_t, int32_t>/join_32bit/100000/400000/manual_time                  0.320 ms        0.335 ms         2181
Join<int32_t, int32_t>/join_32bit/100000/1000000/manual_time                 0.584 ms        0.602 ms         1197
Join<int32_t, int32_t>/join_32bit/10000000/10000000/manual_time               19.0 ms         19.0 ms           37
Join<int32_t, int32_t>/join_32bit/10000000/40000000/manual_time               51.4 ms         51.4 ms           14
Join<int32_t, int32_t>/join_32bit/10000000/100000000/manual_time              72.0 ms         72.0 ms           10
Join<int32_t, int32_t>/join_32bit/100000000/100000000/manual_time              195 ms          195 ms            4
Join<int32_t, int32_t>/join_32bit/80000000/240000000/manual_time               337 ms          337 ms            2
Join<int64_t, int64_t>/join_64bit/50000000/50000000/manual_time               99.3 ms         99.3 ms            7
Join<int64_t, int64_t>/join_64bit/40000000/120000000/manual_time               172 ms          172 ms            4
Join<int32_t, int32_t>/join_32bit_nulls/100000/100000/manual_time            0.211 ms        0.228 ms         3316
Join<int32_t, int32_t>/join_32bit_nulls/100000/400000/manual_time            0.230 ms        0.246 ms         3023
Join<int32_t, int32_t>/join_32bit_nulls/100000/1000000/manual_time           0.341 ms        0.355 ms         2056
Join<int32_t, int32_t>/join_32bit_nulls/10000000/10000000/manual_time         4.23 ms         4.25 ms          165
Join<int32_t, int32_t>/join_32bit_nulls/10000000/40000000/manual_time         10.1 ms         10.2 ms           69
Join<int32_t, int32_t>/join_32bit_nulls/10000000/100000000/manual_time        22.9 ms         22.9 ms           51
Join<int32_t, int32_t>/join_32bit_nulls/100000000/100000000/manual_time       41.9 ms         42.0 ms           17
Join<int32_t, int32_t>/join_32bit_nulls/80000000/240000000/manual_time        66.7 ms         66.7 ms           10
Join<int64_t, int64_t>/join_64bit_nulls/50000000/50000000/manual_time         21.4 ms         21.4 ms           33
Join<int64_t, int64_t>/join_64bit_nulls/40000000/120000000/manual_time        33.9 ms         33.9 ms           21

cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work

Thanks for the suggestions!

Co-authored-by: Jake Hemstad <jhemstad@nvidia.com>
@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 8, 2020

The reason it is so slow without this change is because when nulls are not equal but are inserted into the map, you end up with a huge chain in the hash map. Every time you do an insert of a null value, you end up having to serially iterate that chain. Each slot you visit in the chain requires an atomicCAS, and if multiple threads are trying to iterate the same chain concurrently, you end up with a ton of contention.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice speedup! Could use a doc clarification.

cpp/benchmarks/join/join_benchmark.cu Outdated Show resolved Hide resolved
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 9, 2020

Codecov Report

Merging #6943 (39403b2) into branch-0.18 (252f478) will increase coverage by 0.43%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #6943      +/-   ##
===============================================
+ Coverage        81.58%   82.02%   +0.43%     
===============================================
  Files               96       96              
  Lines            15920    16275     +355     
===============================================
+ Hits             12989    13350     +361     
+ Misses            2931     2925       -6     
Impacted Files Coverage Δ
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 98.74% <0.00%> (+0.02%) ⬆️
python/cudf/cudf/core/join/join.py 92.44% <0.00%> (+0.03%) ⬆️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 252f478...39403b2. Read the comment docs.

@hyperbolic2346 hyperbolic2346 added bug Something isn't working libcudf++ non-breaking Non-breaking change labels Dec 9, 2020
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
@jrhemstad jrhemstad changed the title [REVIEW] Fixing huge performance penalty of nulls not equal joins [REVIEW] Avoid inserting null elements into join hash table when nulls are treated as unequal Dec 9, 2020
cpp/benchmarks/join/join_benchmark.cu Outdated Show resolved Hide resolved
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
Thanks for the suggestions.

Co-authored-by: nvdbaranec <56695930+nvdbaranec@users.noreply.github.com>
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have better random number generators. (not that randomness quality matters much for this benchmark, but it's good to be consistent across benchmarks)

cpp/benchmarks/join/join_benchmark.cu Outdated Show resolved Hide resolved
cpp/benchmarks/join/join_benchmark.cu Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] null keys in a join with null not equal causes performance problems
5 participants