Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow/schema:new func convert_schema for ArrowSchemaConverter #539

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AndreMouche
Copy link
Contributor

@AndreMouche AndreMouche commented Aug 12, 2024

This PR do the following two things for ArrowSchemaConverter struct:

  • for performance: remove unnecessary match operations that could have a negative impact on performance.
  • refactor the struct to make it more friendly for the reviewer and maintain, such as
    • convert_schema: to convert arrow schema to iceberg schema.
    • convert_field : to convert arrow field to iceberg field
    • convert_type : to convert arrow data type to iceberg type.

Meanwhile, I have no idea about the trait ArrowSchemaVisitor, it seems try to keep some middle variable for functions like before_XXX and after_XXX, could anyone please provide some example for it?

…ceberg field

Signed-off-by: Shirly <AndreMouche@126.com>
Signed-off-by: Shirly <AndreMouche@126.com>
Signed-off-by: Shirly <AndreMouche@126.com>
Signed-off-by: Shirly <AndreMouche@126.com>
@AndreMouche AndreMouche force-pushed the convert_arrow_filed_to_iceberg_field branch from 4b84ead to 4bab395 Compare August 12, 2024 13:58
@AndreMouche AndreMouche changed the title arrow/schema:new func convert_fields to convert arrow field to iceberg field arrow/schema:new func convert_schema for ArrowSchemaConverter Aug 12, 2024
@liurenjie1024
Copy link
Collaborator

liurenjie1024 commented Aug 17, 2024

Hi, @AndreMouche Thanks for your contribution, I have some concerns for this pr:

for performance: remove unnecessary match operations that could have a negative impact on performance.

Is there any performance number to comprare these two methods?

Also please note that ArrowSchemaVisitor is a framework for processing ArrowSchema in visitor pattern, not only designed for iceberg schema converter. We will use it later when processing parquet files.

@AndreMouche
Copy link
Contributor Author

Is there any performance number to comprare these two methods?

Comparison of bench test results between current PR and master branches:
current branch:

➜  iceberg-rust git:(convert_arrow_filed_to_iceberg_field) ✗ cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.67s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  29,431,166.70 ns/iter (+/- 1,098,685.87)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 9.02s

➜  iceberg-rust git:(convert_arrow_filed_to_iceberg_field) ✗ cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.56s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  29,581,220.90 ns/iter (+/- 775,051.14)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 8.94s

master @ #567

➜  iceberg-rust git:(bench_mark_arrow_schema) cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.32s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  30,535,066.60 ns/iter (+/- 640,812.07)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 9.26s

➜  iceberg-rust git:(bench_mark_arrow_schema) cargo bench --package iceberg --lib -- arrow::schema::tests::bench_complex_arrow_schema_to_iceberg --exact --show-output 
    Finished `bench` profile [optimized] target(s) in 0.32s
     Running unittests src/lib.rs (target/release/deps/iceberg-ba39310710f5b54e)

running 1 test
test arrow::schema::tests::bench_complex_arrow_schema_to_iceberg ... bench:  30,707,408.30 ns/iter (+/- 1,031,892.12)

successes:

successes:

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 456 filtered out; finished in 9.30s

@AndreMouche
Copy link
Contributor Author

Also please note that ArrowSchemaVisitor is a framework for processing ArrowSchema in visitor pattern, not only designed for iceberg schema converter. We will use it later when processing parquet files.****

make sense to me.

However, I still have few question about it.

Now it seems that we could use ArrowSchemaConverter in visitor pattern, could we make visit_type, visit_list as the as the function to ArrowSchemaConverter ?
ArrowSchemaConverter is designed as a tool to convert arrow_chema only, or a tool can convert shema and data types?

Meanwhile, about the interface ArrowSchemaVisitor

fn schema(&mut self, schema: &ArrowSchema, values: Vec<Self::T>) -> Result<Self::U>;
/// Called after struct's fields visited.
fn r#struct(&mut self, fields: &Fields, results: Vec<Self::T>) -> Result<Self::T>;
/// Called after list fields visited.
fn list(&mut self, list: &DataType, value: Self::T) -> Result<Self::T>;
/// Called after map's key and value fields visited.
fn map(&mut self, map: &DataType, key_value: Self::T, value: Self::T) -> Result<Self::T>;
/// Called when see a primitive type.
fn primitive(&mut self, p: &DataType) -> Result<Self::T>;

For example, we already know list should be a struct List, why we do not take the element_field of the list as the arguments directly?
Meanwhile, since ArrowSchemaVisitor is a general interface, could you please make the definitions of each interface function be made clearer? which will be very helpful for new newcomers like me, thanks.

@AndreMouche
Copy link
Contributor Author

friendly ping @Xuanwo @liurenjie1024

@liurenjie1024
Copy link
Collaborator

Hi, @AndreMouche Sorry for late reply.

It seems that the benchmark result shows that there is no critial performance change?

Now it seems that we could use ArrowSchemaConverter in visitor pattern, could we make visit_type, visit_list as the as the function to ArrowSchemaConverter ?

Sory I don't get your point. visit_type/visit_list are functions of SchemaVisitor, how could we make it part of ArrowSchemaConverter.

For example, we already know list should be a struct List, why we do not take the element_field of the list as the arguments directly?

I think this is a good suggestion. However, there are several kinds of lists in arrow, for example List, LargeList, FixedSizeList, so I think current design is a tradeoff so that we don't need to deal with all variants.

Meanwhile, since ArrowSchemaVisitor is a general interface, could you please make the definitions of each interface function be made clearer? which will be very helpful for new newcomers like me, thanks.

I totally agree that we should add some doc for new contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants