arrow/schema:new func `convert_schema` for `ArrowSchemaConverter` #539

AndreMouche · 2024-08-12T10:02:13Z

This PR do the following two things for ArrowSchemaConverter struct:

for performance: remove unnecessary match operations that could have a negative impact on performance.
refactor the struct to make it more friendly for the reviewer and maintain, such as
- convert_schema: to convert arrow schema to iceberg schema.
- convert_field : to convert arrow field to iceberg field
- convert_type : to convert arrow data type to iceberg type.

Meanwhile, I have no idea about the trait ArrowSchemaVisitor, it seems try to keep some middle variable for functions like before_XXX and after_XXX, could anyone please provide some example for it?

…ceberg field Signed-off-by: Shirly <AndreMouche@126.com>

Signed-off-by: Shirly <AndreMouche@126.com>

liurenjie1024 · 2024-08-17T02:55:31Z

Hi, @AndreMouche Thanks for your contribution, I have some concerns for this pr:

for performance: remove unnecessary match operations that could have a negative impact on performance.

Is there any performance number to comprare these two methods?

Also please note that ArrowSchemaVisitor is a framework for processing ArrowSchema in visitor pattern, not only designed for iceberg schema converter. We will use it later when processing parquet files.

AndreMouche · 2024-08-19T10:09:23Z

Is there any performance number to comprare these two methods?

Comparison of bench test results between current PR and master branches:
current branch:

➜  iceberg-rust git:(convert_arrow_filed_to_iceberg_field) ✗ cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.67s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  29,431,166.70 ns/iter (+/- 1,098,685.87)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 9.02s

➜  iceberg-rust git:(convert_arrow_filed_to_iceberg_field) ✗ cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.56s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  29,581,220.90 ns/iter (+/- 775,051.14)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 8.94s

master @ #567

➜  iceberg-rust git:(bench_mark_arrow_schema) cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.32s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  30,535,066.60 ns/iter (+/- 640,812.07)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 9.26s

➜  iceberg-rust git:(bench_mark_arrow_schema) cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.32s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  30,707,408.30 ns/iter (+/- 1,031,892.12)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 9.30s

AndreMouche · 2024-08-19T10:34:18Z

Also please note that ArrowSchemaVisitor is a framework for processing ArrowSchema in visitor pattern, not only designed for iceberg schema converter. We will use it later when processing parquet files.****

make sense to me.

However, I still have few question about it.

Now it seems that we could use ArrowSchemaConverter in visitor pattern, could we make visit_type, visit_list as the as the function to ArrowSchemaConverter ?
ArrowSchemaConverter is designed as a tool to convert arrow_chema only, or a tool can convert shema and data types?

Meanwhile, about the interface ArrowSchemaVisitor

iceberg-rust/crates/iceberg/src/arrow/schema.rs

Lines 96 to 108 in a1ec0fa

    
           fn schema(&mut self, schema: &ArrowSchema, values: Vec<Self::T>) -> Result<Self::U>; 
        
           /// Called after struct's fields visited. 
        
           fn r#struct(&mut self, fields: &Fields, results: Vec<Self::T>) -> Result<Self::T>; 
        
           /// Called after list fields visited. 
        
           fn list(&mut self, list: &DataType, value: Self::T) -> Result<Self::T>; 
        
           /// Called after map's key and value fields visited. 
        
           fn map(&mut self, map: &DataType, key_value: Self::T, value: Self::T) -> Result<Self::T>; 
        
           /// Called when see a primitive type. 
        
           fn primitive(&mut self, p: &DataType) -> Result<Self::T>;

For example, we already know list should be a struct List, why we do not take the element_field of the list as the arguments directly?
Meanwhile, since ArrowSchemaVisitor is a general interface, could you please make the definitions of each interface function be made clearer? which will be very helpful for new newcomers like me, thanks.

AndreMouche · 2024-08-23T06:10:51Z

friendly ping @Xuanwo @liurenjie1024

liurenjie1024 · 2024-09-03T00:50:26Z

Hi, @AndreMouche Sorry for late reply.

It seems that the benchmark result shows that there is no critial performance change?

Now it seems that we could use ArrowSchemaConverter in visitor pattern, could we make visit_type, visit_list as the as the function to ArrowSchemaConverter ?

Sory I don't get your point. visit_type/visit_list are functions of SchemaVisitor, how could we make it part of ArrowSchemaConverter.

For example, we already know list should be a struct List, why we do not take the element_field of the list as the arguments directly?

I think this is a good suggestion. However, there are several kinds of lists in arrow, for example List, LargeList, FixedSizeList, so I think current design is a tradeoff so that we don't need to deal with all variants.

Meanwhile, since ArrowSchemaVisitor is a general interface, could you please make the definitions of each interface function be made clearer? which will be very helpful for new newcomers like me, thanks.

I totally agree that we should add some doc for new contributors.

AndreMouche added 4 commits August 12, 2024 17:46

arrow/schema:new funcation convert_fields to convert arrow field to i…

e767086

…ceberg field Signed-off-by: Shirly <AndreMouche@126.com>

fix clippy

ad9a348

Signed-off-by: Shirly <AndreMouche@126.com>

arrow/schema: refactor ArrowSchemaConverter

53fc8e1

Signed-off-by: Shirly <AndreMouche@126.com>

fix clippy

4bab395

Signed-off-by: Shirly <AndreMouche@126.com>

AndreMouche force-pushed the convert_arrow_filed_to_iceberg_field branch from 4b84ead to 4bab395 Compare August 12, 2024 13:58

AndreMouche changed the title ~~arrow/schema:new func convert_fields to convert arrow field to iceberg field~~ arrow/schema:new func convert_schema for ArrowSchemaConverter Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arrow/schema:new func `convert_schema` for `ArrowSchemaConverter` #539

arrow/schema:new func `convert_schema` for `ArrowSchemaConverter` #539

AndreMouche commented Aug 12, 2024 •

edited

Loading

liurenjie1024 commented Aug 17, 2024 •

edited

Loading

AndreMouche commented Aug 19, 2024

AndreMouche commented Aug 19, 2024

AndreMouche commented Aug 23, 2024

liurenjie1024 commented Sep 3, 2024

arrow/schema:new func convert_schema for ArrowSchemaConverter #539

Are you sure you want to change the base?

arrow/schema:new func convert_schema for ArrowSchemaConverter #539

Conversation

AndreMouche commented Aug 12, 2024 • edited Loading

liurenjie1024 commented Aug 17, 2024 • edited Loading

AndreMouche commented Aug 19, 2024

AndreMouche commented Aug 19, 2024

AndreMouche commented Aug 23, 2024

liurenjie1024 commented Sep 3, 2024

arrow/schema:new func `convert_schema` for `ArrowSchemaConverter` #539

arrow/schema:new func `convert_schema` for `ArrowSchemaConverter` #539

AndreMouche commented Aug 12, 2024 •

edited

Loading

liurenjie1024 commented Aug 17, 2024 •

edited

Loading