Skip to content

Conversation

@westhide
Copy link
Contributor

@westhide westhide commented Mar 20, 2025

Which issue does this PR close?

Rationale for this change

Support serde for FileScanConfig batch_size

When FileScanConfig call open(..), will set batch_size for file_source, so this PR add batch_size property for struct FileScanConfig to support serde batch_size to proto.

impl DataSource for FileScanConfig {
fn open(
&self,
partition: usize,
context: Arc<TaskContext>,
) -> Result<SendableRecordBatchStream> {
let object_store = context.runtime_env().object_store(&self.object_store_url)?;
let source = self
.file_source
.with_batch_size(context.session_config().batch_size())
.with_schema(Arc::clone(&self.file_schema))
.with_projection(self);
let opener = source.create_file_opener(object_store, self, partition);
let stream = FileStream::new(self, partition, opener, source.metrics())?;
Ok(Box::pin(stream))
}

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added proto Related to proto crate datasource Changes to the datasource crate labels Mar 20, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @westhide

Should we remove the batch_size from JSON source too?

batch_size: Option<usize>,

reserved 10;

datafusion_common.Constraints constraints = 11;
optional uint32 batch_size = 12;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can use uint64 as batch_size is a usize 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@westhide westhide force-pushed the main-serde-batch_size branch from 0959545 to 27676a0 Compare March 20, 2025 16:07
@westhide
Copy link
Contributor Author

Thank you @westhide

Should we remove the batch_size from JSON source too?

batch_size: Option<usize>,

Can we add context: Arc<TaskContext> to the create_file_opener args? The default batch_size is from context.session_config().batch_size(). If we remove batch_size property in JsonSource, we need to get the default value by context

pub trait FileSource: Send + Sync {
    /// Creates a `dyn FileOpener` based on given parameters
    fn create_file_opener(
        &self,
        object_store: Arc<dyn ObjectStore>,
        base_config: &FileScanConfig,
        partition: usize,
    ) -> Arc<dyn FileOpener>;
    // ...
}

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @westhide and @alamb for review

@alamb alamb merged commit 3269f01 into apache:main Mar 21, 2025
27 checks passed
@westhide westhide deleted the main-serde-batch_size branch March 22, 2025 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants