-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12995: [C++] Add validation to CSV options #10505
Conversation
7ab0c8d
to
5dc61de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
cpp/src/arrow/csv/options.cc
Outdated
if (ARROW_PREDICT_FALSE((delimiter < ' ' && delimiter != '\t') || delimiter > '~')) { | ||
return Status::Invalid( | ||
"ParseOptions: delimiter must be a printable ascii char or '\\t': 0x", | ||
std::setfill('0'), std::setw(2), std::hex, static_cast<uint16_t>(delimiter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think I understand this check. Is there something in the CSV parser that currently prevents using other delimiters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the characters before
are terminal control characters and the only ascii character after ~
is del
. While these characters could be used in theory for a text file they would be very strange couldn't be displayed on a terminal. These are the characters this is excluding https://flaviocopes.com/non-printable-ascii-characters/ . Also, I don't think the user could specify values like \n
or \r
since that would break the parser since they have to be used for end of line. Currently the python layer requires all character values to be between 1 - 127 inclusive.
I kept tab because tab separated values is a format which is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's our duty to guard against dubious values. Perhaps some weird formats use a form-feed character as a delimiter, who knows?
The only think that may be reasonable may be to forbid \n
and \r
. Otherwise we should just let the user choose whatever they like.
cpp/src/arrow/csv/options.cc
Outdated
"ParseOptions: delimiter must be a printable ascii char or '\\t': 0x", | ||
std::setfill('0'), std::setw(2), std::hex, static_cast<uint16_t>(delimiter)); | ||
} | ||
if (ARROW_PREDICT_FALSE(quoting && (quote_char < ' ' || quote_char > '~'))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same questions below.
python/pyarrow/_csv.pyx
Outdated
@@ -58,6 +58,7 @@ cdef class ReadOptions(_Weakrefable): | |||
How much bytes to process at a time from the input stream. | |||
This will determine multi-threading granularity as well as | |||
the size of individual record batches or table chunks. | |||
Minimum valid value for block size is 1KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean 1B, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, I wanted 1KB but tripped over tests. Guess I missed a spot
cdda435
to
92726b7
Compare
92726b7
to
942d5f7
Compare
Rebased, will merge if CI is green. |
AppVeyor failure is unrelated (https://issues.apache.org/jira/browse/ARROW-13090). |
Thank you again @n3world ! |
Closes apache#10505 from n3world/ARROW-12995-Validate_csv_opts Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Antoine Pitrou <antoine@python.org>
No description provided.