Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12995: [C++] Add validation to CSV options #10505

Closed
wants to merge 1 commit into from

Conversation

n3world
Copy link
Contributor

@n3world n3world commented Jun 10, 2021

No description provided.

@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

if (ARROW_PREDICT_FALSE((delimiter < ' ' && delimiter != '\t') || delimiter > '~')) {
return Status::Invalid(
"ParseOptions: delimiter must be a printable ascii char or '\\t': 0x",
std::setfill('0'), std::setw(2), std::hex, static_cast<uint16_t>(delimiter));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think I understand this check. Is there something in the CSV parser that currently prevents using other delimiters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the characters before are terminal control characters and the only ascii character after ~ is del. While these characters could be used in theory for a text file they would be very strange couldn't be displayed on a terminal. These are the characters this is excluding https://flaviocopes.com/non-printable-ascii-characters/ . Also, I don't think the user could specify values like \n or \r since that would break the parser since they have to be used for end of line. Currently the python layer requires all character values to be between 1 - 127 inclusive.
I kept tab because tab separated values is a format which is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's our duty to guard against dubious values. Perhaps some weird formats use a form-feed character as a delimiter, who knows?

The only think that may be reasonable may be to forbid \n and \r. Otherwise we should just let the user choose whatever they like.

"ParseOptions: delimiter must be a printable ascii char or '\\t': 0x",
std::setfill('0'), std::setw(2), std::hex, static_cast<uint16_t>(delimiter));
}
if (ARROW_PREDICT_FALSE(quoting && (quote_char < ' ' || quote_char > '~'))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same questions below.

@@ -58,6 +58,7 @@ cdef class ReadOptions(_Weakrefable):
How much bytes to process at a time from the input stream.
This will determine multi-threading granularity as well as
the size of individual record batches or table chunks.
Minimum valid value for block size is 1KB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean 1B, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, I wanted 1KB but tripped over tests. Guess I missed a spot

@n3world n3world force-pushed the ARROW-12995-Validate_csv_opts branch 3 times, most recently from cdda435 to 92726b7 Compare June 15, 2021 21:28
@pitrou pitrou force-pushed the ARROW-12995-Validate_csv_opts branch from 92726b7 to 942d5f7 Compare June 16, 2021 09:31
@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

Rebased, will merge if CI is green.

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

AppVeyor failure is unrelated (https://issues.apache.org/jira/browse/ARROW-13090).

@pitrou pitrou closed this in 4d19225 Jun 16, 2021
@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

Thank you again @n3world !

sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
Closes apache#10505 from n3world/ARROW-12995-Validate_csv_opts

Authored-by: Nate Clark <nate@neworld.us>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@n3world n3world deleted the ARROW-12995-Validate_csv_opts branch July 21, 2021 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants