Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace normalization of nested column coerced as string column in JSONL inputs #16759

Merged

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Sep 5, 2024

Description

Addresses #15280

Whitespace normalization is expected to remove unquoted whitespace characters in JSON lines inputs. However, in the cases where the JSON line is invalid due to an unquoted whitespace occurring in between numbers or literals, the existing normalization implementation is incorrect since it removes these invalidating whitespaces and makes the line valid.

This PR implements the normalization as a post-processing step on only nested columns forced as string columns.
Idea:

  1. Create a single buffer by concatenating the rows of the string column. Create segment offsets and lengths array for concatenated buffer
  2. Run a complementary whitespace normalization FST i.e. NOP for non-whitespace and quoted whitespace characters, and output indices of unquoted whitespace characters
  3. Update segment lengths based on the number of output indices between segment offsets
  4. Remove characters at output indices from concatenated buffer.
  5. Return updated buffer, segment lengths and updated segment offsets

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shrshi shrshi added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Sep 5, 2024
@shrshi shrshi marked this pull request as ready for review September 6, 2024 00:27
@shrshi shrshi requested a review from a team as a code owner September 6, 2024 00:27
@shrshi shrshi requested review from vyasr and davidwendt September 6, 2024 00:27
@vuule vuule self-requested a review September 6, 2024 21:04
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major concerns

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
@shrshi shrshi requested a review from karthikeyann September 6, 2024 23:23
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing some test failures with this patch where some of the data being read is not being normalized. I think that the issue is coming from the idea that the data needs to be "mixed" when in reality we want to normalize the white space for any row in a column where the output is a string but the input included LIST or STRUCT parts. For example an input of

{"data": {"A": 0, "B": 1}}

with a requested schema of data STRING results in

{"A": 0, "B": 1}

As if the normalization didn't happen. But if I add in more lines to make it a mixed type, then it passes.

{"data": {"A": 0, "B": 1}}
{"data": "test"}

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That fixed all of the issues I was having and all of the tests for white space normalization now pass.

@vuule vuule self-requested a review September 10, 2024 17:00
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having UI troble, flushing the comments to try to fix it

cpp/tests/io/json/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Sep 17, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Sep 17, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Sep 18, 2024

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
only one suggestion on memory usage for stencil.

cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
LGTM 👍

@shrshi
Copy link
Contributor Author

shrshi commented Sep 18, 2024

/merge

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 19, 2024
@karthikeyann
Copy link
Contributor

/ok to test

@karthikeyann
Copy link
Contributor

Posting offline discussion here: (to avoid upper_bound and atomics).
Use a permutation iterator instead outbuf_indices.data(), to directly scatter to stencil.
Use the original FST (output non-whitespace), and use a permutation iterator to create stencil.
then use HistogramRange to create the offsets?

auto out = thrust::make_transform_output_iterator(
    thrust::discard_iterator{},
    [stencil=stencil.begin()](auto const& i) -> bool { stencil[i] = true; return true; });

@karthikeyann
Copy link
Contributor

/ok to test

karthikeyann added a commit to karthikeyann/cudf that referenced this pull request Sep 19, 2024
@rapids-bot rapids-bot bot merged commit 30e3946 into rapidsai:branch-24.10 Sep 19, 2024
97 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants