Conversation
…erg-rust into feature/gb/file-column
…erg-rust into feature/gb/file-column
This reverts commit adf0da0.
…erg-rust into feature/gb/file-column
| // The _file column will be a RunEndEncoded array with the file path | ||
| if let Some(run_array) = file_col | ||
| .as_any() | ||
| .downcast_ref::<arrow_array::RunArray<arrow_array::types::Int32Type>>() |
There was a problem hiding this comment.
perhaps just downcast to the int array and iterate through values, instead of having to separate run array values from run array run ends
There was a problem hiding this comment.
Also shouldn't we not have REE here?
There was a problem hiding this comment.
Regarding the not REE stuff: No, I think in a single PR, we should have REE, and if we want to disable it we should have a separate one.
There was a problem hiding this comment.
ah, ok, so you'd follow up with that right after this PR?
| { | ||
| record_batch_transformer_builder = | ||
| record_batch_transformer_builder.with_partition(partition_spec, partition_data); | ||
| record_batch_transformer_builder.with_partition(partition_spec, partition_data)?; |
There was a problem hiding this comment.
do you want to do this for incremental scan as well?
There was a problem hiding this comment.
We do not have partition information (yet? not sure if it is needed at all) in the incremental scan
There was a problem hiding this comment.
Would we get wrong results because of it? If table had partitions and some partition transforms for example?
There was a problem hiding this comment.
I am not sure, this partition stuff has only been added recently. We may just add the same logic to the incremental tasks, but we first need to understand what's the actual issue
|
|
||
| for column_name in column_names.iter() { | ||
| // Handle metadata columns (like "_file") | ||
| if is_metadata_column_name(column_name) { |
There was a problem hiding this comment.
I think it already shows that it's a burden to have to maintain same features in two places (not only code but also tests)...at some point we should maybe reconsider how to consolidate this
| /// The Arrow Field definition for the metadata column, or an error if not a metadata field | ||
| pub fn get_metadata_field(field_id: i32) -> Result<Arc<Field>> { | ||
| match field_id { | ||
| RESERVED_FIELD_ID_FILE => Ok(Arc::clone(file_field())), |
There was a problem hiding this comment.
what about other fields? Also should they have static lazy field?
There was a problem hiding this comment.
It depends on what the use of these fields is. Since this is a constant arrow field we can create it. But I leave the others until they are actually used.
There was a problem hiding this comment.
Maybe we should include it for the sake of #10 (comment). Otherwise it's really tactile and error-prone
| pub fn get_metadata_field(field_id: i32) -> Result<Arc<Field>> { | ||
| match field_id { | ||
| RESERVED_FIELD_ID_FILE => Ok(Arc::clone(file_field())), | ||
| _ if is_metadata_field(field_id) => { |
There was a problem hiding this comment.
If we include other fields as suggested in the comment above, this will never fire. I think we have to extend is_metadata_field to include all metadata fields (a range), and that way we would throw this proper error.
There was a problem hiding this comment.
Yeah I left implementation of the other fields out intentionally, hence this fires. The functions here do not have to return actual fields for all constants, rather metadata_columns.rs is in general a module for collecting all sorts of constants/helpers. It is for example not clear whether you would ever want a field for _pos, because this comes from the arrow reader. However, we still want to collect the constants here...
There was a problem hiding this comment.
We talked offline and agreed to remove non used constants, and for those that are used we should introduce Lazy Fields. Then we can address this comment I think.
…iceberg-rust into feature/gb/file-column-inc
vustef
left a comment
There was a problem hiding this comment.
Thanks Gerald.
I'm not particularly clear on metadata column fields listing, but otherwise looks good. We should follow up with disabling REE right after this PR though, unless you want to pick Arrow.jl battle to support REE there.
Closes RAI-43688
Closes RAI-43694
Upstream PR is here: apache#1824