Support `newlines_in_values` CSV option #11533

connec · 2024-07-18T14:22:40Z

Which issue does this PR close?

Rationale for this change

This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can only be controlled by the session-level datafusion.optimizer.repartition_file_scans and datafusion.optimizer.repartition_file_min_size settings.

What changes are included in this PR?

This commit introduces a newlines_in_values option to CsvOptions and plumbs it through to CsvExec, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis.

I've added newlines_in_values using similar conventions to has_header, with CsvOptions using an Option<bool> and a default value coming from datafusion::common::config::CatalogOptions.

For now, in the interests of being surgical, I've just added a new argument to CsvExec::new, which is now triggering clippy::too_many_arguments. Before going any further I wanted to see if this was overall a good approach, but I'm happy to refactor this into an options struct or similar.

Are these changes tested?

Yes – a new test has been added alongside the existing tests for file scan repartioning in datafusion/core/src/datasource/file_format/csv.rs.

Are there any user-facing changes?

Breaking: Add public datafusion::common::config::CatalogOptions::newlines_in_values: bool field, default: false.
Breaking: Add public datafusion::common::config::CsvOptions::newlines_in_values: Option<bool> field, default: None.
Breaking: Add public datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values: bool field, default: false.
Breaking: Add newlines_in_values: bool argument to datafusion::datasource::physical_plan::CsvExec::new.
Add public datafusion::common::config::CsvOptions::with_newlines_in_values method.
Add public datafusion::datasource::file_format::csv::CsvFormat::with_newlines_in_values method.
Add public datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values method.
Add public datafusion::datasource::physical_plan::CsvExec::newlines_in_values method.
Add newlines_in_values to relevant proto files.

This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change.

alamb

Thank you @connec -- this PR looks very nice to me (and a very nice first contribution🏅)

I think the only thing missing is an end to end test showing that with this change DataFusion can actually read CSV files with newlines in them

Here are the instructions for such end to end tets https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

Maybe you could add an example data file in https://github.com/apache/datafusion/tree/main/datafusion/core/tests/data

And the write a test that defined the external table with the appropriate options and read from it

Ideally you should be able to follow the model of one of the existing test files, perhaps

datafusion/datafusion/sqllogictest/test_files/csv_files.slt

Lines 18 to 27 in 382bf4f

    
           # create_external_table_with_quote_escape 
        
           statement ok 
        
           CREATE EXTERNAL TABLE csv_with_quote ( 
        
           c1 VARCHAR, 
        
           c2 VARCHAR 
        
           ) STORED AS CSV 
        
           LOCATION '../core/tests/data/quote.csv' 
        
           OPTIONS ('format.quote' '~', 
        
                    'format.delimiter' ',', 
        
                    'format.has_header' 'true');

cc @2010YOUY01

datafusion/common/src/config.rs

datafusion/core/src/datasource/file_format/csv.rs

2010YOUY01

Thank you for the new feature support! It is a great UX improvement given 'newlines in quotes' is defined in CSV standard

I also find this PR's design is good and very easy to read, I think it's good to go with extra integration tests for CSVs with newlines_in_values fields as @alamb mentioned.

datafusion/common/src/config.rs

2010YOUY01 · 2024-07-19T04:47:51Z

I'm trying to provide more context for the rationale for this PR's implementation:

Currently, DataFusion's parallel CSV scan will first split the file range evenly, and adjust the range border by probing \n forward.
It's super hard to support newliens_in_values in parallel scan in this way: executor have to probe backwards to determine if the current field is qutoed, and it don't know when to stop.

So this PR's approach (explicitly adding one CsvOption) is reasonable, maybe in the future we can parse the whole file during planning to find out the correct split points, if it can help performance in some use cases.

connec · 2024-07-19T09:31:52Z

Thanks both for the rapid feedback. I'll take a look and update the PR.

connec · 2024-07-19T11:41:54Z

I've made some documentation updates and added/fixed sqllogictests.

There's the new clippy issue around having too many arguments for CsvExec::new, which I can either fix with a new Options struct of some kind, or suppress and fix in a follow-up.

alamb · 2024-07-19T21:28:34Z

There's the new clippy issue around having too many arguments for CsvExec::new, which I can either fix with a new Options struct of some kind, or suppress and fix in a follow-up.

I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR

I would personally suggest a builder pattern like

let csv_exec = CsvExec::builder()
  .with_has_header(true)
  .with_newlines_in_values(true)
  ...
  .build()?

Following the model of https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExecBuilder.html

alamb

Thanks @connec -- this looks good to me 👍

I think once we have the CI passing this PR is good to go from my perspective.

alamb · 2024-07-19T21:29:40Z

datafusion/core/tests/data/newlines_in_values.csv

@@ -0,0 +1,13 @@
+id,message
+1,"hello


connec · 2024-07-19T22:07:42Z

I think that should be the fixes needed for CI.

alamb · 2024-07-20T09:25:45Z

I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR

I filed #11565 to track improving the clippy lint

alamb

Thanks again @connec and @2010YOUY01 👌

alamb · 2024-07-20T10:33:54Z

🤔 the slt test seems to be failing on windows (likely due to newline nonsense).

https://github.com/apache/datafusion/actions/runs/10014892577/job/27695257736

We should probably figure out a way to not run that test on windows. Gah!

Is this something you have time to look at @connec ? If not, I think we could comment out the slt test to get this PR in and then sort out the testing as a follow on PR.

Note to reduce the feedback loop cycle time, I merged this branch to main and pushed the commit (which I think means the CI should automatically run now on new commits). Once we have merged this PR I think your subsequent PRs will also have the CI run automatically.

connec · 2024-07-20T10:51:53Z

I'd guess it might be related to there being a \r on windows? I have some time to look so I'll see if I can resolve it.

Edit: ran out of time today, but I can continue tomorrow.

This is a bit of a stab in the dark, but it might fix multiline tests on Windows.

connec · 2024-07-20T17:24:57Z

Unfortunately it seems like the workflows still need approval @alamb.

~~I'm not entirely sure if .gitattributes change will fix the slts (I don't have a Windows system to test on) but 🤞~~ I've got ahold of a windows system and this hasn't worked. I'll do some more debugging when I can.

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.

connec · 2024-07-20T19:24:25Z

The issue is caused by the line endings in the newlines_in_values.csv file being converted to CRLF on windows. While either line ending is accepted as a CSV record terminator by default, newlines in values are not normalised at all. This means that the value returned from the query on Windows includes CRLF rather than LF. Since newlines in the .slt file itself are normalised, this then leads to the diff.

For now, I've added a .gitattributes entry to checkout the newlines_in_values.csv file with LF line endings always, which resolves the problem on my windows machine at least.

alamb · 2024-07-21T23:42:00Z

🚀
Thanks again @connec

* feat!: support `newlines_in_values` CSV option This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change. * docs: normalise `newlines_in_values` documentation * test: add/fix sqllogictests for `newlines_in_values` * docs: document `datafusion.catalog.newlines_in_values` * fix: typo in config.md * chore: suppress lint on too many arguments for `CsvExec::new` * fix: always checkout `*.slt` with LF line endings This is a bit of a stab in the dark, but it might fix multiline tests on Windows. * fix: always checkout `newlines_in_values.csv` with `LF` line endings The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added the core Core DataFusion crate label Jul 18, 2024

alamb reviewed Jul 18, 2024

View reviewed changes

datafusion/common/src/config.rs Outdated Show resolved Hide resolved

datafusion/core/src/datasource/file_format/csv.rs Outdated Show resolved Hide resolved

2010YOUY01 reviewed Jul 19, 2024

View reviewed changes

datafusion/common/src/config.rs Outdated Show resolved Hide resolved

connec added 2 commits July 19, 2024 12:30

docs: normalise newlines_in_values documentation

e05ca0e

test: add/fix sqllogictests for newlines_in_values

9ca9065

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 19, 2024

docs: document datafusion.catalog.newlines_in_values

34dcdb0

github-actions bot added the documentation Improvements or additions to documentation label Jul 19, 2024

alamb reviewed Jul 19, 2024

View reviewed changes

datafusion/core/tests/data/newlines_in_values.csv

@@ -0,0 +1,13 @@

id,message

1,"hello

Copy link

Contributor

alamb Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

connec added 2 commits July 19, 2024 22:55

fix: typo in config.md

8c2d98d

chore: suppress lint on too many arguments for CsvExec::new

ed0075d

alamb mentioned this pull request Jul 20, 2024

Fix clippy lint for the number of arguments to CsvExec::new() #11565

Closed

alamb approved these changes Jul 20, 2024

View reviewed changes

Merge remote-tracking branch 'apache/main' into csv-newlines-in-values

356f46b

connec added 2 commits July 20, 2024 18:21

fix: always checkout *.slt with LF line endings

b9cc96b

This is a bit of a stab in the dark, but it might fix multiline tests on Windows.

Merge remote-tracking branch 'origin/main' into csv-newlines-in-values

4d06432

alamb merged commit 2587df0 into apache:main Jul 21, 2024

connec deleted the csv-newlines-in-values branch July 22, 2024 07:23

connec mentioned this pull request Jul 24, 2024

Add CsvExecBuilder for creating CsvExec #11633

Merged

	# create_external_table_with_quote_escape
	statement ok
	CREATE EXTERNAL TABLE csv_with_quote (
	c1 VARCHAR,
	c2 VARCHAR
	) STORED AS CSV
	LOCATION '../core/tests/data/quote.csv'
	OPTIONS ('format.quote' '~',
	'format.delimiter' ',',
	'format.has_header' 'true');

Support newlines_in_values CSV option #11533

Support newlines_in_values CSV option #11533

Uh oh!

Conversation

connec commented Jul 18, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2010YOUY01 commented Jul 19, 2024

Uh oh!

connec commented Jul 19, 2024

Uh oh!

connec commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jul 19, 2024

Choose a reason for hiding this comment

Uh oh!

connec commented Jul 19, 2024

Uh oh!

alamb commented Jul 20, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 20, 2024

Uh oh!

connec commented Jul 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connec commented Jul 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connec commented Jul 20, 2024

Uh oh!

alamb commented Jul 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support `newlines_in_values` CSV option #11533

Support `newlines_in_values` CSV option #11533

alamb left a comment •

edited

Loading

connec commented Jul 19, 2024 •

edited

Loading

alamb commented Jul 19, 2024 •

edited

Loading

connec commented Jul 20, 2024 •

edited

Loading

connec commented Jul 20, 2024 •

edited

Loading