Make the BatchSerializer behind Arc to avoid unnecessary struct creation #8666

metesynnada · 2023-12-28T12:24:50Z

Which issue does this PR close?

Closes #.

Rationale for this change

Currently, the serializer is re-created for each RecordBatch, which degrades the performance while dealing with small batch sizes.

The duplicate() method is called here

https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/write/orchestration.rs#L51-L82

where it is defined as (for CSV it is used for header, for JSON it is just a deep clone.)

https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/csv.rs#L432-L450

Also, this makes the internal buffer useless since it is re-created for each batch in this setup.

What changes are included in this PR?

Renamed the BatchSerializer.
Make the trait methods take immutable references.
Make the type SerializerType = Arc<dyn SerializationSchema>
Handle the making CSV header false for the batches after the first batch.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

ozankabak · 2023-12-28T14:15:13Z

@tustvold would appreciate it if you can take a look

ozankabak

LGTM but having one more reviewer would be good to make sure we are not losing any performance due to allocation-related reasons

alamb

Thank you @metesynnada -- I think this PR looks like a good step forward to me. The only thing I think should be answered is why the trait is renamed. I also left some other suggestions but I don't think any are necessary.

@devinjdangelo I think you wrote a non trivial chunk of this code -- do you have any thoughts on this PR?

datafusion/core/src/datasource/file_format/write/mod.rs

alamb · 2023-12-28T19:47:18Z

datafusion/core/src/datasource/file_format/write/mod.rs

@@ -149,15 +145,14 @@ impl<W: AsyncWrite + Unpin + Send> AsyncWrite for AbortableWrite<W> {

 /// A trait that defines the methods required for a RecordBatch serializer.
 #[async_trait]
-pub trait BatchSerializer: Unpin + Send {
+pub trait SerializationSchema: Sync + Send {


What is the rationale for renaming this trait? It doesn't seem directly related to a Schema 🤔 I think the original name BatchSerializer better matches what the trait does

datafusion/core/src/datasource/file_format/write/mod.rs

datafusion/core/src/datasource/file_format/write/orchestration.rs

alamb · 2023-12-28T19:51:35Z

datafusion/core/src/datasource/file_format/write/orchestration.rs

@@ -171,9 +164,9 @@ pub(crate) async fn stateless_serialize_and_write_files(
                // this thread, so we cannot clean it up (hence any_abort_errors is true)
                any_errors = true;
                any_abort_errors = true;
-                triggering_error = Some(DataFusionError::Internal(format!(
+                triggering_error = Some(internal_datafusion_err!(


👍 nice cleanup

alamb · 2023-12-28T20:00:05Z

LGTM but having one more reviewer would be good to make sure we are not losing any performance due to allocation-related reasons

Since the previous version of the code was cloning the serializer anyway and the API returns Bytes (a read only structure) I don't think this PR is doing more allocations than before. Arguably it is better in that it is clearer now how allocations are performed

ozankabak · 2023-12-28T20:10:25Z

Thanks for reviewing @alamb. Your suggestions make sense and we will apply them. I will consult with @metesynnada about the name choice. Maybe he has a good reason (that I can not think of right now) why he thinks the name should change. If not, we will stick with the old name.

ozankabak · 2023-12-29T12:37:05Z

I just talked to @metesynnada and the names in this PR are simply following Flink's naming convention. I reverted to the old names for the scope of this PR, we may take up naming in the future as a separate topic of discussion.

devinjdangelo · 2023-12-29T18:00:23Z

Sorry I'm late to the discussion, but yes this LGTM too. I agree that BatchSerializer makes more sense to me as a name.

metesynnada added 2 commits December 28, 2023 15:10

Make the BatchSerializer behind Arc

122e875

Commenting

9442f18

github-actions bot added the core Core DataFusion crate label Dec 28, 2023

metesynnada requested review from mustafasrepo, alamb and ozankabak December 28, 2023 12:27

Review

ef73390

ozankabak approved these changes Dec 28, 2023

View reviewed changes

alamb reviewed Dec 28, 2023

View reviewed changes

alamb mentioned this pull request Dec 28, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 25, 2023 #8655

Closed

7 tasks

ozankabak added 3 commits December 28, 2023 23:35

Merge branch 'apache_main' into upstream/serialization

15f9fb6

Incorporate review suggestions

225253b

Use old names

2428463

ozankabak merged commit b85a397 into apache:main Dec 29, 2023

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make the BatchSerializer behind Arc to avoid unnecessary struct creation #8666

Make the BatchSerializer behind Arc to avoid unnecessary struct creation #8666

Uh oh!

metesynnada commented Dec 28, 2023 •

edited

Loading

Uh oh!

ozankabak commented Dec 28, 2023

Uh oh!

ozankabak left a comment

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

alamb Dec 28, 2023

Uh oh!

Uh oh!

Uh oh!

alamb Dec 28, 2023

Uh oh!

alamb commented Dec 28, 2023

Uh oh!

ozankabak commented Dec 28, 2023

Uh oh!

ozankabak commented Dec 29, 2023

Uh oh!

devinjdangelo commented Dec 29, 2023

Uh oh!

Uh oh!

Make the BatchSerializer behind Arc to avoid unnecessary struct creation #8666

Make the BatchSerializer behind Arc to avoid unnecessary struct creation #8666

Uh oh!

Conversation

metesynnada commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ozankabak commented Dec 28, 2023

Uh oh!

ozankabak left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb Dec 28, 2023

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 28, 2023

Uh oh!

ozankabak commented Dec 28, 2023

Uh oh!

ozankabak commented Dec 29, 2023

Uh oh!

devinjdangelo commented Dec 29, 2023

Uh oh!

Uh oh!

metesynnada commented Dec 28, 2023 •

edited

Loading