You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support newlines_in_values CSV option (apache#11533)
* feat!: support `newlines_in_values` CSV option
This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.
With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.
This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.
BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.
* docs: normalise `newlines_in_values` documentation
* test: add/fix sqllogictests for `newlines_in_values`
* docs: document `datafusion.catalog.newlines_in_values`
* fix: typo in config.md
* chore: suppress lint on too many arguments for `CsvExec::new`
* fix: always checkout `*.slt` with LF line endings
This is a bit of a stab in the dark, but it might fix multiline tests on
Windows.
* fix: always checkout `newlines_in_values.csv` with `LF` line endings
The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
0 commit comments