Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JNI bindings to write CSV #12425

Merged
merged 27 commits into from
Jan 5, 2023
Merged

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Dec 20, 2022

Description

This change adds JNI bindings to write tables out as CSV, to either the filesystem or memory.

The Java Table class now has additional methods:

  1. Table.writeCSVToFile(): Writes the current table out to the specified file on the filesystem.
  2. Table.writeCSVToBuffer(): Writes the current table out to a HostBufferConsumer.

These calls are analogous to cudf::io::write_csv().

Current limitations:

  1. The cudf::io::csv_writer_options interface binds the CSV options tightly to the Table being written. This makes it a little clumsy to write multiple Tables to the same HostBufferConsumer, because each could be written with different, contradictory options. This has now been addressed. Chunked CSV writing is now permitted (albeit only the host memory buffers).
  2. cudf::io::write_csv(file_name) overwrites the specified file, if it exists. There currently isn't a way to keep a file open, and write multiple tables to it; each write call overwrites the previous file.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mythrocks mythrocks added feature request New feature or request Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Dec 20, 2022
@mythrocks mythrocks self-assigned this Dec 20, 2022
@mythrocks mythrocks marked this pull request as draft December 20, 2022 00:32
@mythrocks mythrocks requested a review from a team as a code owner December 20, 2022 00:32
@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 20, 2022
This change adds JNI bindings to write tables out as CSV, to either the filesystem or memory.

The Java Table class now has additional methods:

1. Table.writeCSVToFile(): Writes the current table out to the specified file on the filesystem.
2. Table.writeCSVToBuffer(): Writes the current table out to a HostBufferConsumer.
These calls are analogous to cudf::io::write_csv().

Current limitations:

1. The cudf::io::csv_writer_options interface binds the CSV options tightly to the Table being written. This makes it a little clumsy to write multiple Tables to the same HostBufferConsumer, because each could be written with different, contradictory options.
2. cudf::io::write_csv(file_name) overwrites the specified file, if it exists. There currently isn't a way to keep a file open, and write multiple tables to it; each write call overwrites the previous file.
@mythrocks
Copy link
Contributor Author

This makes it a little clumsy to write multiple Tables to the same HostBufferConsumer

I'm working on a workaround for this, now. Should have it shortly.

@revans2
Copy link
Contributor

revans2 commented Dec 20, 2022

This makes it a little clumsy to write multiple Tables to the same HostBufferConsumer

I'm working on a workaround for this, now. Should have it shortly.

Be very careful when using the HostBufferConsumer because we don't want to write out the header multiple times. Also the others have checks in the C++ code when doing chunked writes that the column types/etc match between multiple invocations. IF we are going to create our own chunked writer API, we should add in our own corresponding checks too.

@mythrocks
Copy link
Contributor Author

This makes it a little clumsy to write multiple Tables to the same HostBufferConsumer

I'm working on a workaround for this, now. Should have it shortly.

Be very careful when using the HostBufferConsumer because we don't want to write out the header multiple times. Also the others have checks in the C++ code when doing chunked writes that the column types/etc match between multiple invocations. IF we are going to create our own chunked writer API, we should add in our own corresponding checks too.

Yes, sir. That's what prompted me to write a chunked writer. It will reside in JNI for now. I can see about moving it to cpp/ afterwards, if there's interest.

@ajschmidt8
Copy link
Member

ajschmidt8 commented Dec 20, 2022

The forked branch for this PR did not have the GH Actions CI files included because it was outdated, so I merged the latest changes into this branch to start CI.

1. Added setter to change Table instance in csv_writer_options.
2. Plumbing for new chunked writer.
3. Tests.
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 21, 2022
@mythrocks
Copy link
Contributor Author

Sorry I missed your update, @ajschmidt8.

This new update might be out of date as well. I'll upmerge and push very shortly.

@mythrocks mythrocks marked this pull request as ready for review December 28, 2022 18:36
@mythrocks mythrocks requested a review from a team as a code owner December 28, 2022 18:36
@mythrocks mythrocks mentioned this pull request Dec 31, 2022
3 tasks
@mythrocks
Copy link
Contributor Author

Re: the CI test failures:

/opt/conda/bin/git -c protocol.version=2 fetch --prune --progress --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
  Error: fatal: unable to access 'https://github.com/rapidsai/cudf/': The requested URL returned error: 429
  The process '/opt/conda/bin/git' failed with exit code 128

Accessing the CUDF repo on GitHub hit rate limiting. :]

@mythrocks
Copy link
Contributor Author

Rerun tests

@mythrocks mythrocks requested a review from ttnghia January 3, 2023 18:03
*
* @param table Table to be written
*/
void set_table(table_view const& table) { _table = table; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. As mentioned offline, we might want to look into separating the sink, input table and writer options to facilitate easier reuse of options. Not in scope for this PR.

@ttnghia
Copy link
Contributor

ttnghia commented Jan 3, 2023

Gently remind that the changed files need to be updated copyright year ;)

@mythrocks
Copy link
Contributor Author

Gently remind that the changed files need to be updated copyright year ;)

:)) Good catch.
I thought I was safe because the commits are from last year, but we're doing squashed commits. I'll update.

@mythrocks mythrocks requested a review from vuule January 3, 2023 22:17
@codecov
Copy link

codecov bot commented Jan 3, 2023

Codecov Report

Base: 86.58% // Head: 85.69% // Decreases project coverage by -0.88% ⚠️

Coverage data is based on head (8d9b374) compared to base (b6dccb3).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.02   #12425      +/-   ##
================================================
- Coverage         86.58%   85.69%   -0.89%     
================================================
  Files               155      155              
  Lines             24368    24797     +429     
================================================
+ Hits              21098    21249     +151     
- Misses             3270     3548     +278     
Impacted Files Coverage Δ
python/cudf/cudf/_version.py 1.41% <0.00%> (-98.59%) ⬇️
python/cudf/cudf/core/buffer/spill_manager.py 72.50% <0.00%> (-7.50%) ⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py 90.04% <0.00%> (-2.81%) ⬇️
python/cudf/cudf/utils/dtypes.py 77.77% <0.00%> (-1.69%) ⬇️
python/cudf/cudf/options.py 86.11% <0.00%> (-1.59%) ⬇️
python/cudf/cudf/core/single_column_frame.py 94.30% <0.00%> (-1.27%) ⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py 98.38% <0.00%> (-1.01%) ⬇️
python/dask_cudf/dask_cudf/io/parquet.py 91.81% <0.00%> (-0.59%) ⬇️
python/cudf/cudf/core/multiindex.py 91.66% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/algorithms.py 90.00% <0.00%> (-0.48%) ⬇️
... and 37 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mythrocks
Copy link
Contributor Author

mythrocks commented Jan 4, 2023

Conflicting files
java/src/main/native/src/TableJni.cpp

I'm on it now. Resolved now.

@mythrocks
Copy link
Contributor Author

Waiting on rapidsai/rapids-cmake#337 to re-kick the build.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes look fine to me. The one issue that I have is that for the hive text file format the cudf CSV write is not going to be compatible. By default hive does not do any special escapes for string values. CUDF does escaping.

struct escape_strings_fn {

Should we spend some time to make this optional? Or am I missing what this is for?

java/src/main/java/ai/rapids/cudf/CSVWriterOptions.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/CSVWriterOptions.java Outdated Show resolved Hide resolved
java/src/main/native/src/csv_chunked_writer.hpp Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

CUDF does escaping.

Thank you for catching that. I will try tackle that in a follow-up issue, specifically to make the escaping optional.

@mythrocks mythrocks requested a review from revans2 January 5, 2023 00:28
@mythrocks mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 5, 2023
@mythrocks
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 5d2cf8c into rapidsai:branch-23.02 Jan 5, 2023
@mythrocks
Copy link
Contributor Author

I've merged this. Thank you to all who reviewed.
A couple of follow-up PRs will, well, follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants