-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JNI bindings to write CSV #12425
JNI bindings to write CSV #12425
Conversation
This change adds JNI bindings to write tables out as CSV, to either the filesystem or memory. The Java Table class now has additional methods: 1. Table.writeCSVToFile(): Writes the current table out to the specified file on the filesystem. 2. Table.writeCSVToBuffer(): Writes the current table out to a HostBufferConsumer. These calls are analogous to cudf::io::write_csv(). Current limitations: 1. The cudf::io::csv_writer_options interface binds the CSV options tightly to the Table being written. This makes it a little clumsy to write multiple Tables to the same HostBufferConsumer, because each could be written with different, contradictory options. 2. cudf::io::write_csv(file_name) overwrites the specified file, if it exists. There currently isn't a way to keep a file open, and write multiple tables to it; each write call overwrites the previous file.
I'm working on a workaround for this, now. Should have it shortly. |
Be very careful when using the |
Yes, sir. That's what prompted me to write a chunked writer. It will reside in JNI for now. I can see about moving it to |
The forked branch for this PR did not have the GH Actions CI files included because it was outdated, so I merged the latest changes into this branch to start CI. |
1. Added setter to change Table instance in csv_writer_options. 2. Plumbing for new chunked writer. 3. Tests.
91cc650
to
7fa0204
Compare
Sorry I missed your update, @ajschmidt8. This new update might be out of date as well. I'll upmerge and push very shortly. |
Re: the CI test failures:
Accessing the CUDF repo on GitHub hit rate limiting. :] |
Rerun tests |
* | ||
* @param table Table to be written | ||
*/ | ||
void set_table(table_view const& table) { _table = table; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. As mentioned offline, we might want to look into separating the sink, input table and writer options to facilitate easier reuse of options. Not in scope for this PR.
Gently remind that the changed files need to be updated copyright year ;) |
:)) Good catch. |
Codecov ReportBase: 86.58% // Head: 85.69% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #12425 +/- ##
================================================
- Coverage 86.58% 85.69% -0.89%
================================================
Files 155 155
Lines 24368 24797 +429
================================================
+ Hits 21098 21249 +151
- Misses 3270 3548 +278
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
Waiting on rapidsai/rapids-cmake#337 to re-kick the build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changes look fine to me. The one issue that I have is that for the hive text file format the cudf CSV write is not going to be compatible. By default hive does not do any special escapes for string values. CUDF does escaping.
cudf/cpp/src/io/csv/writer_impl.cu
Line 75 in 8d1e2cc
struct escape_strings_fn { |
Should we spend some time to make this optional? Or am I missing what this is for?
Thank you for catching that. I will try tackle that in a follow-up issue, specifically to make the escaping optional. |
1. Removed TODO. 2. Changed default null value to empty string.
/merge |
I've merged this. Thank you to all who reviewed. |
Description
This change adds JNI bindings to write tables out as CSV, to either the filesystem or memory.
The Java
Table
class now has additional methods:Table.writeCSVToFile()
: Writes the current table out to the specified file on the filesystem.Table.writeCSVToBuffer()
: Writes the current table out to aHostBufferConsumer
.These calls are analogous to
cudf::io::write_csv()
.Current limitations:
TheThis has now been addressed. Chunked CSV writing is now permitted (albeit only the host memory buffers).cudf::io::csv_writer_options
interface binds the CSV options tightly to theTable
being written. This makes it a little clumsy to write multipleTable
s to the sameHostBufferConsumer
, because each could be written with different, contradictory options.cudf::io::write_csv(file_name)
overwrites the specified file, if it exists. There currently isn't a way to keep a file open, and write multiple tables to it; each write call overwrites the previous file.Checklist