Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix all null list column with missing child column in JSON reader #17348

Merged
merged 25 commits into from
Dec 6, 2024

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Nov 16, 2024

Fixes #17341 and fixes #17349.

  • expose make_empty_column
  • fix all null list with empty nested child on missing list child column when prune columns is enabled
  • add unit test

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann requested a review from a team as a code owner November 16, 2024 05:36
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 16, 2024
@karthikeyann karthikeyann requested a review from ttnghia November 16, 2024 05:36
@karthikeyann karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Nov 16, 2024
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
rmm::device_buffer{},
rmm::device_buffer{},
0,
std::move(child_columns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?
Could this utility be used instead?

std::unique_ptr<column> make_empty_lists_column(data_type child_type,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works only for fixed width type child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to avoid any purge/has_nonempty_nulls calls.

rmm::device_buffer{},
rmm::device_buffer{},
0,
std::move(child_columns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do different than the factory call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discovered through profiling that make_lists_column and make_structs_column introduce significant overhead by unnecessarily calling to purge_non_empty_nulls and superimpose_nulls_no_sanitize. These sanitization may be needed in general, however, here we know that the child(ren) of JSON output column cannot have non-null rows if the parent is null (due to invalid input) thus we want to avoid such overhead by calling to the column constructor directly.

Copy link
Contributor

@ttnghia ttnghia Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann Please add comments to clarify the reason why we have this change:

// Do not use `cudf::make_lists_column` since we do not need to call `purge_nonempty_nulls`
// on the child column as it does not have non-empty nulls.

and

// Do not use `cudf::make_structs_column` since we do not need to call `superimpose_nulls`
// on the children columns.

Copy link
Contributor

@ttnghia ttnghia Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed a related issue for this: #17356

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we could fix the make_structs_column and make_lists_column to not call these utilities for empty columns.
Or the utilities could be fixed to no-op for empty columns perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either fix is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm still not seeing how make_structs_column is calling superimpose_nulls when the creating an empty column. The logic in the factory function only calls superimpose_nulls if a non-empty null-mask is passed:

if (!null_mask.is_empty()) {
for (auto& child : child_columns) {
child = structs::detail::superimpose_nulls(static_cast<bitmask_type const*>(null_mask.data()),
null_count,
std::move(child),
stream,
mr);
}
}

Copy link
Contributor

@ttnghia ttnghia Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I was wrong on this: it is unnecessary here (for constructing empty column) since the input column has empty null mask.

Copy link

codecov bot commented Nov 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (branch-25.02@c791f80). Learn more about missing BASE report.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-25.02   #17348   +/-   ##
===============================================
  Coverage                ?   84.15%           
===============================================
  Files                   ?      231           
  Lines                   ?    32928           
  Branches                ?        0           
===============================================
  Hits                    ?    27712           
  Misses                  ?     5216           
  Partials                ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@lamarrr lamarrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

thrust::for_each_n(
rmm::exec_policy_nosync(stream),
thrust::make_counting_iterator<size_type>(0),
list_children_end - thrust::make_zip_iterator(node_ids.begin(), parent_col_ids.begin()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use thrust::distance ?

Suggested change
list_children_end - thrust::make_zip_iterator(node_ids.begin(), parent_col_ids.begin()),
thrust::distance(thrust::make_zip_iterator(node_ids.begin(), parent_col_ids.begin(), list_children_end),

Copy link
Contributor

@ttnghia ttnghia Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to cause CI to fail:

 /home/coder/cudf/cpp/src/io/json/host_tree_algorithms.cu(953): error: no instance of function template "thrust::distance" matches the argument list
              argument types are: (thrust::zip_iterator<thrust::tuple<cudf::size_type *, cudf::size_type *, thrust::zip_iterator<thrust::tuple<cudf::io::json::NodeIndexT *, cudf::io::json::NodeIndexT *>>>>, lambda [](cudf::size_type)->void)
        thrust::distance(thrust::make_zip_iterator(node_ids.begin(), parent_col_ids.begin(), list_children_end),
        ^
  /home/coder/cudf/cpp/build/conda/cuda-12.5/release/_deps/cccl-src/thrust/thrust/cmake/../../thrust/detail/distance.inl(38): note #3327-D: candidate function template "thrust::distance" failed deduction
    distance(InputIterator first, InputIterator last)
    ^
  
  /home/coder/cudf/cpp/src/io/json/host_tree_algorithms.cu(969): error: expected a ")"
        });

Essentially there is a missing ) for make_zip_iterator:

--thrust::distance(thrust::make_zip_iterator(..., parent_col_ids.begin(), list_children_end),
++thrust::distance(thrust::make_zip_iterator(..., parent_col_ids.begin()), list_children_end),

Copy link
Contributor

@ttnghia ttnghia Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that the same expression is used to compute num_list_children below this line.

Copy link
Contributor

@ttnghia ttnghia Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can reduce code duplicate by something like this:

auto const list_children_begin = thrust::make_zip_iterator(node_ids.begin(), parent_col_ids.begin());
// replace all `thrust::make_zip_iterator` by `list_children_begin`.
auto const num_list_children = thrust::distance(list_children_begin, list_children_end);
// use num_list_children

karthikeyann and others added 2 commits December 4, 2024 12:16
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 14b4321 into branch-25.02 Dec 6, 2024
104 of 105 checks passed
@ttnghia ttnghia deleted the fix-json_pruned_empty_list_column branch December 7, 2024 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
5 participants