fix: parallel parquet can underflow when max_record_batch_rows < execution.batch_size #9737

devinjdangelo · 2024-03-22T13:47:23Z

Which issue does this PR close?

Rationale for this change

See issue

What changes are included in this PR?

Parallel parquet writer can now handle the case when max_record_batch_rows < execution.batch_size by iteratively splitting the record batch rather than assuming it only needs to be split once.

Are these changes tested?

Yes, added new test that would panic prior to this PR

Are there any user-facing changes?

Just bugfix

alamb

Thank you @devinjdangelo -- this code looks good to me. I think there are some small issues with the test to fix, but otherwise I think this is good to go.

🙏

I ran the test without the changes in this PR and it fails like

b.rs (target/debug/deps/datafusion-4cbfc61ad6017be4)

attempt to subtract with overflow
thread 'dataframe::parquet::tests::write_parquet_with_small_rg_size' panicked at datafusion/core/src/datasource/file_format/parquet.rs:885:33:
attempt to subtract with overflow
stack backtrace:

alamb · 2024-03-22T18:24:52Z

datafusion/core/src/dataframe/parquet.rs

+        }
+        let output_path = "file://local/test.parquet";
+
+        for rg_size in (1..7).step_by(5) {


My reading of the docs and my playground experiments suggests this is the same as [1, 6] -- is that the intent? Or did you mean 1, 5, 10, 15, 20, 25, 30, 35?

Yes, [1, 6] is all I meant and would be clearer... I originally wanted to loop over more rg sizes but the test was slow. If we streamline the test, we could actually range over more values here.

This now loops over 0..10 with datafusion.execution.batch_size set to 10.

alamb · 2024-03-22T18:30:50Z

datafusion/core/src/dataframe/parquet.rs

+            .await?;
+
+            // Check that file actually used the correct rg size
+            let file = std::fs::File::open(tmp_dir.into_path().join("test.parquet"))?;


Calling into_path here I think means the file won't be cleaned up

I think calling path() would ensure the file is cleaned up

I fixed this in the new test and a preexisting one.

alamb · 2024-03-22T18:35:21Z

datafusion/core/src/dataframe/parquet.rs

+        let mut test_df = test_util::test_table().await?;
+        // make the test data larger so there are multiple batches
+        for _ in 0..7 {
+            test_df = test_df.clone().union(test_df)?;


When I ran this test it takes more than 22 seconds on my laptop. I wonder if we really need to generate so much data -- maybe we can try slicing up the batch (or else maybe use larger rg_sizes)

$ cargo test --lib -p datafusion -- write_parquet_with_small_rg_size ... Finished test [unoptimized + debuginfo] target(s) in 0.16s Running unittests src/lib.rs (target/debug/deps/datafusion-4cbfc61ad6017be4) running 1 test test dataframe::parquet::tests::write_parquet_with_small_rg_size ... ok test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 651 filtered out; finished in 22.31s

We should be able to trigger the issue with less data by lowering execution.batch_size to something small like 10 rows.

The test with a batch size of 10 still panics on main and passes in this PR, but runs in 0.41 seconds.

alamb

Thanks @devinjdangelo -- I also verified the test now runs quickly and still panics

Thank you so much

alamb · 2024-03-23T11:13:02Z

datafusion/core/src/dataframe/parquet.rs


            // Check that file actually used the specified compression
-            let file = std::fs::File::open(tmp_dir.into_path().join("test.parquet"))?;
+            let file = std::fs::File::open(tmp_dir.path().join("test.parquet"))?;


thank you for the driveby cleanup

loop split rb

030e59d

github-actions bot added the core Core DataFusion crate label Mar 22, 2024

devinjdangelo added 2 commits March 22, 2024 09:53

add test

4d46fcc

add new test

677272f

devinjdangelo marked this pull request as ready for review March 22, 2024 14:10

fmt

40cf8f9

This was referenced Mar 22, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

Release DataFusion 37.0.0 #9682

Closed

alamb reviewed Mar 22, 2024

View reviewed changes

devinjdangelo added 3 commits March 22, 2024 18:36

lower batch size in test

8b67b12

make test faster

c9da8c8

use path not into_path

0dc697d

alamb approved these changes Mar 23, 2024

View reviewed changes

alamb merged commit 02fd450 into apache:main Mar 23, 2024

fix: parallel parquet can underflow when max_record_batch_rows < execution.batch_size #9737

fix: parallel parquet can underflow when max_record_batch_rows < execution.batch_size #9737

Uh oh!

Conversation

devinjdangelo commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devinjdangelo commented Mar 22, 2024 •

edited

Loading