-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whitespace normalization of nested column coerced as string column in JSONL inputs #16759
Whitespace normalization of nested column coerced as string column in JSONL inputs #16759
Conversation
…i/cudf into json-whitespace-normalization-post
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major concerns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing some test failures with this patch where some of the data being read is not being normalized. I think that the issue is coming from the idea that the data needs to be "mixed" when in reality we want to normalize the white space for any row in a column where the output is a string but the input included LIST or STRUCT parts. For example an input of
{"data": {"A": 0, "B": 1}}
with a requested schema of data STRING
results in
{"A": 0, "B": 1}
As if the normalization didn't happen. But if I add in more lines to make it a mixed type, then it passes.
{"data": {"A": 0, "B": 1}}
{"data": "test"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That fixed all of the issues I was having and all of the tests for white space normalization now pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having UI troble, flushing the comments to try to fix it
…i/cudf into json-whitespace-normalization-post
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
only one suggestion on memory usage for stencil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
LGTM 👍
/merge |
/ok to test |
Posting offline discussion here: (to avoid
|
/ok to test |
Description
Addresses #15280
Whitespace normalization is expected to remove unquoted whitespace characters in JSON lines inputs. However, in the cases where the JSON line is invalid due to an unquoted whitespace occurring in between numbers or literals, the existing normalization implementation is incorrect since it removes these invalidating whitespaces and makes the line valid.
This PR implements the normalization as a post-processing step on only nested columns forced as string columns.
Idea:
Checklist