-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid per-batch field lookups in SchemaMapping #6563
Avoid per-batch field lookups in SchemaMapping #6563
Conversation
let rows_num = batch.num_rows(); | ||
let mapped_batch = mapping.map_batch(batch).unwrap(); | ||
let projected = batch.project(&projection).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This highlights the major change, the schema adaptor assumes that the projection it output has been applied to the file_schema batches.
} | ||
|
||
/// Creates a `SchemaMapping` that can be used to cast or map the columns from the file schema to the table schema. | ||
/// | ||
/// If the provided `file_schema` contains columns of a different type to the expected | ||
/// `table_schema`, the method will attempt to cast the array data from the file schema | ||
/// to the table schema where possible. | ||
/// | ||
/// Returns a [`SchemaMapping`] that can be applied to the output batch | ||
/// along with an ordered list of columns to project from the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ordered is important as parquet::ProjectionMask
is not order preserving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- this looks like a nice cleanup to me
.zip(&self.field_mappings) | ||
.map(|(field, file_idx)| match file_idx { | ||
Some(batch_idx) => cast(&batch_cols[*batch_idx], field.data_type()), | ||
None => Ok(new_null_array(field.data_type(), batch_rows)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #.
Rationale for this change
Follow up to #6458. This reworks the mapping logic to avoid needing to do column lookups per batch
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?