-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support newlines_in_values CSV option
#11533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @connec -- this PR looks very nice to me (and a very nice first contribution🏅)
I think the only thing missing is an end to end test showing that with this change DataFusion can actually read CSV files with newlines in them
Here are the instructions for such end to end tets https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest
Maybe you could add an example data file in https://github.com/apache/datafusion/tree/main/datafusion/core/tests/data
And the write a test that defined the external table with the appropriate options and read from it
Ideally you should be able to follow the model of one of the existing test files, perhaps
datafusion/datafusion/sqllogictest/test_files/csv_files.slt
Lines 18 to 27 in 382bf4f
| # create_external_table_with_quote_escape | |
| statement ok | |
| CREATE EXTERNAL TABLE csv_with_quote ( | |
| c1 VARCHAR, | |
| c2 VARCHAR | |
| ) STORED AS CSV | |
| LOCATION '../core/tests/data/quote.csv' | |
| OPTIONS ('format.quote' '~', | |
| 'format.delimiter' ',', | |
| 'format.has_header' 'true'); |
cc @2010YOUY01
2010YOUY01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the new feature support! It is a great UX improvement given 'newlines in quotes' is defined in CSV standard
I also find this PR's design is good and very easy to read, I think it's good to go with extra integration tests for CSVs with newlines_in_values fields as @alamb mentioned.
|
I'm trying to provide more context for the rationale for this PR's implementation: Currently, DataFusion's parallel CSV scan will first split the file range evenly, and adjust the range border by probing So this PR's approach (explicitly adding one |
|
Thanks both for the rapid feedback. I'll take a look and update the PR. |
|
I've made some documentation updates and added/fixed sqllogictests. There's the new clippy issue around having too many arguments for |
I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR I would personally suggest a builder pattern like let csv_exec = CsvExec::builder()
.with_has_header(true)
.with_newlines_in_values(true)
...
.build()?Following the model of https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExecBuilder.html |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @connec -- this looks good to me 👍
I think once we have the CI passing this PR is good to go from my perspective.
| @@ -0,0 +1,13 @@ | |||
| id,message | |||
| 1,"hello | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
I think that should be the fixes needed for CI. |
I filed #11565 to track improving the clippy lint |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @connec and @2010YOUY01 👌
|
🤔 the slt test seems to be failing on windows (likely due to newline nonsense). https://github.com/apache/datafusion/actions/runs/10014892577/job/27695257736 We should probably figure out a way to not run that test on windows. Gah! Is this something you have time to look at @connec ? If not, I think we could comment out the slt test to get this PR in and then sort out the testing as a follow on PR. Note to reduce the feedback loop cycle time, I merged this branch to main and pushed the commit (which I think means the CI should automatically run now on new commits). Once we have merged this PR I think your subsequent PRs will also have the CI run automatically. |
|
I'd guess it might be related to there being a Edit: ran out of time today, but I can continue tomorrow. |
This is a bit of a stab in the dark, but it might fix multiline tests on Windows.
|
Unfortunately it seems like the workflows still need approval @alamb.
|
The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.
|
The issue is caused by the line endings in the For now, I've added a |
|
🚀 |
* feat!: support `newlines_in_values` CSV option This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change. * docs: normalise `newlines_in_values` documentation * test: add/fix sqllogictests for `newlines_in_values` * docs: document `datafusion.catalog.newlines_in_values` * fix: typo in config.md * chore: suppress lint on too many arguments for `CsvExec::new` * fix: always checkout `*.slt` with LF line endings This is a bit of a stab in the dark, but it might fix multiline tests on Windows. * fix: always checkout `newlines_in_values.csv` with `LF` line endings The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat!: support `newlines_in_values` CSV option This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change. * docs: normalise `newlines_in_values` documentation * test: add/fix sqllogictests for `newlines_in_values` * docs: document `datafusion.catalog.newlines_in_values` * fix: typo in config.md * chore: suppress lint on too many arguments for `CsvExec::new` * fix: always checkout `*.slt` with LF line endings This is a bit of a stab in the dark, but it might fix multiline tests on Windows. * fix: always checkout `newlines_in_values.csv` with `LF` line endings The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #11472.
Rationale for this change
This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators.
With the current implementation, this behaviour can only be controlled by the session-level
datafusion.optimizer.repartition_file_scansanddatafusion.optimizer.repartition_file_min_sizesettings.What changes are included in this PR?
This commit introduces a
newlines_in_valuesoption toCsvOptionsand plumbs it through toCsvExec, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis.I've added
newlines_in_valuesusing similar conventions tohas_header, withCsvOptionsusing anOption<bool>and a default value coming fromdatafusion::common::config::CatalogOptions.For now, in the interests of being surgical, I've just added a new argument to
CsvExec::new, which is now triggeringclippy::too_many_arguments. Before going any further I wanted to see if this was overall a good approach, but I'm happy to refactor this into an options struct or similar.Are these changes tested?
Yes – a new test has been added alongside the existing tests for file scan repartioning in
datafusion/core/src/datasource/file_format/csv.rs.Are there any user-facing changes?
datafusion::common::config::CatalogOptions::newlines_in_values: boolfield, default:false.datafusion::common::config::CsvOptions::newlines_in_values: Option<bool>field, default:None.datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values: boolfield, default:false.newlines_in_values: boolargument todatafusion::datasource::physical_plan::CsvExec::new.datafusion::common::config::CsvOptions::with_newlines_in_valuesmethod.datafusion::datasource::file_format::csv::CsvFormat::with_newlines_in_valuesmethod.datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_valuesmethod.datafusion::datasource::physical_plan::CsvExec::newlines_in_valuesmethod.newlines_in_valuesto relevant proto files.