Skip to content

Conversation

@connec
Copy link
Contributor

@connec connec commented Jul 18, 2024

Which issue does this PR close?

Closes #11472.

Rationale for this change

This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can only be controlled by the session-level datafusion.optimizer.repartition_file_scans and datafusion.optimizer.repartition_file_min_size settings.

What changes are included in this PR?

This commit introduces a newlines_in_values option to CsvOptions and plumbs it through to CsvExec, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis.

I've added newlines_in_values using similar conventions to has_header, with CsvOptions using an Option<bool> and a default value coming from datafusion::common::config::CatalogOptions.

For now, in the interests of being surgical, I've just added a new argument to CsvExec::new, which is now triggering clippy::too_many_arguments. Before going any further I wanted to see if this was overall a good approach, but I'm happy to refactor this into an options struct or similar.

Are these changes tested?

Yes – a new test has been added alongside the existing tests for file scan repartioning in datafusion/core/src/datasource/file_format/csv.rs.

Are there any user-facing changes?

  • Breaking: Add public datafusion::common::config::CatalogOptions::newlines_in_values: bool field, default: false.
  • Breaking: Add public datafusion::common::config::CsvOptions::newlines_in_values: Option<bool> field, default: None.
  • Breaking: Add public datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values: bool field, default: false.
  • Breaking: Add newlines_in_values: bool argument to datafusion::datasource::physical_plan::CsvExec::new.
  • Add public datafusion::common::config::CsvOptions::with_newlines_in_values method.
  • Add public datafusion::datasource::file_format::csv::CsvFormat::with_newlines_in_values method.
  • Add public datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values method.
  • Add public datafusion::datasource::physical_plan::CsvExec::newlines_in_values method.
  • Add newlines_in_values to relevant proto files.

This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.

This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.

BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.
@github-actions github-actions bot added the core Core DataFusion crate label Jul 18, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @connec -- this PR looks very nice to me (and a very nice first contribution🏅)

I think the only thing missing is an end to end test showing that with this change DataFusion can actually read CSV files with newlines in them

Here are the instructions for such end to end tets https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

Maybe you could add an example data file in https://github.com/apache/datafusion/tree/main/datafusion/core/tests/data

And the write a test that defined the external table with the appropriate options and read from it

Ideally you should be able to follow the model of one of the existing test files, perhaps

# create_external_table_with_quote_escape
statement ok
CREATE EXTERNAL TABLE csv_with_quote (
c1 VARCHAR,
c2 VARCHAR
) STORED AS CSV
LOCATION '../core/tests/data/quote.csv'
OPTIONS ('format.quote' '~',
'format.delimiter' ',',
'format.has_header' 'true');

cc @2010YOUY01

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the new feature support! It is a great UX improvement given 'newlines in quotes' is defined in CSV standard

I also find this PR's design is good and very easy to read, I think it's good to go with extra integration tests for CSVs with newlines_in_values fields as @alamb mentioned.

@2010YOUY01
Copy link
Contributor

I'm trying to provide more context for the rationale for this PR's implementation:

Currently, DataFusion's parallel CSV scan will first split the file range evenly, and adjust the range border by probing \n forward.
It's super hard to support newliens_in_values in parallel scan in this way: executor have to probe backwards to determine if the current field is qutoed, and it don't know when to stop.

So this PR's approach (explicitly adding one CsvOption) is reasonable, maybe in the future we can parse the whole file during planning to find out the correct split points, if it can help performance in some use cases.

@connec
Copy link
Contributor Author

connec commented Jul 19, 2024

Thanks both for the rapid feedback. I'll take a look and update the PR.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 19, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 19, 2024
@connec
Copy link
Contributor Author

connec commented Jul 19, 2024

I've made some documentation updates and added/fixed sqllogictests.

There's the new clippy issue around having too many arguments for CsvExec::new, which I can either fix with a new Options struct of some kind, or suppress and fix in a follow-up.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

There's the new clippy issue around having too many arguments for CsvExec::new, which I can either fix with a new Options struct of some kind, or suppress and fix in a follow-up.

I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR

I would personally suggest a builder pattern like

let csv_exec = CsvExec::builder()
  .with_has_header(true)
  .with_newlines_in_values(true)
  ...
  .build()?

Following the model of https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExecBuilder.html

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @connec -- this looks good to me 👍

I think once we have the CI passing this PR is good to go from my perspective.

@@ -0,0 +1,13 @@
id,message
1,"hello
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@connec
Copy link
Contributor Author

connec commented Jul 19, 2024

I think that should be the fixes needed for CI.

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR

I filed #11565 to track improving the clippy lint

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @connec and @2010YOUY01 👌

@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

🤔 the slt test seems to be failing on windows (likely due to newline nonsense).

https://github.com/apache/datafusion/actions/runs/10014892577/job/27695257736

We should probably figure out a way to not run that test on windows. Gah!

Is this something you have time to look at @connec ? If not, I think we could comment out the slt test to get this PR in and then sort out the testing as a follow on PR.

Note to reduce the feedback loop cycle time, I merged this branch to main and pushed the commit (which I think means the CI should automatically run now on new commits). Once we have merged this PR I think your subsequent PRs will also have the CI run automatically.

@connec
Copy link
Contributor Author

connec commented Jul 20, 2024

I'd guess it might be related to there being a \r on windows? I have some time to look so I'll see if I can resolve it.

Edit: ran out of time today, but I can continue tomorrow.

connec added 2 commits July 20, 2024 18:21
This is a bit of a stab in the dark, but it might fix multiline tests on
Windows.
@connec
Copy link
Contributor Author

connec commented Jul 20, 2024

Unfortunately it seems like the workflows still need approval @alamb.

I'm not entirely sure if .gitattributes change will fix the slts (I don't have a Windows system to test on) but 🤞 I've got ahold of a windows system and this hasn't worked. I'll do some more debugging when I can.

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.
@connec
Copy link
Contributor Author

connec commented Jul 20, 2024

The issue is caused by the line endings in the newlines_in_values.csv file being converted to CRLF on windows. While either line ending is accepted as a CSV record terminator by default, newlines in values are not normalised at all. This means that the value returned from the query on Windows includes CRLF rather than LF. Since newlines in the .slt file itself are normalised, this then leads to the diff.

For now, I've added a .gitattributes entry to checkout the newlines_in_values.csv file with LF line endings always, which resolves the problem on my windows machine at least.

@alamb
Copy link
Contributor

alamb commented Jul 21, 2024

🚀
Thanks again @connec

@alamb alamb merged commit 2587df0 into apache:main Jul 21, 2024
@connec connec deleted the csv-newlines-in-values branch July 22, 2024 07:23
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* feat!: support `newlines_in_values` CSV option

This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.

This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.

BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.

* docs: normalise `newlines_in_values` documentation

* test: add/fix sqllogictests for `newlines_in_values`

* docs: document `datafusion.catalog.newlines_in_values`

* fix: typo in config.md

* chore: suppress lint on too many arguments for `CsvExec::new`

* fix: always checkout `*.slt` with LF line endings

This is a bit of a stab in the dark, but it might fix multiline tests on
Windows.

* fix: always checkout `newlines_in_values.csv` with `LF` line endings

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* feat!: support `newlines_in_values` CSV option

This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.

This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.

BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.

* docs: normalise `newlines_in_values` documentation

* test: add/fix sqllogictests for `newlines_in_values`

* docs: document `datafusion.catalog.newlines_in_values`

* fix: typo in config.md

* chore: suppress lint on too many arguments for `CsvExec::new`

* fix: always checkout `*.slt` with LF line endings

This is a bit of a stab in the dark, but it might fix multiline tests on
Windows.

* fix: always checkout `newlines_in_values.csv` with `LF` line endings

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for newlines_in_values to CsvOptions

3 participants