Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Nov 4, 2025

Which issue does this PR close?

Rationale for this change

An appreciable slowdown in Parquet writing was noticed. At least part of the slowdown seems to stem from changes to basic::LogicalType which caused cloning the enum to take make longer than previously.

What changes are included in this PR?

This adds a new logical_type_ref call to BasicTypeInfo and ColumnDescriptor. Unlike the existing logical_type which returns a cloned Option<LogicalType>, the new methods return Option<&LogicalType>. This new function is used in place of logical_type internally.

Are these changes tested?

Should be covered by existing tests.

Are there any user-facing changes?

No.

A further optimization would be to change ColumnOrder::get_sort_order to take an Option<&LogicalType>, but that is a breaking API change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 4, 2025
@etseidl etseidl changed the title Log type ref perf: Speed up Parquet file writing Nov 4, 2025
@alamb alamb changed the title perf: Speed up Parquet file writing perf: Speed up Parquet file writing (10%, back to speed of 56) Nov 5, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl

I verified this is about 10% faster in the tpchgen-cli benchmarl

A further optimization would be to change ColumnOrder::get_sort_order to take an Option<&LogicalType>, but that is a breaking API change.

I tried this out and didn't see any noticable performance change in my test, but I will make a PR anyways to discuss

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

I wonder if we should add a comment to LogicalType that it isn't trivially copy'able (and remove the Copy in later breaking release)?

@etseidl
Copy link
Contributor Author

etseidl commented Nov 5, 2025

Thanks @mbrobbel and @alamb!

I wonder if we should add a comment to LogicalType that it isn't trivially copy'able (and remove the Copy in later breaking release)?

I did add a note to the getters that the value is cloned. I just wonder do we want to deprecate the cloning getters? It would be nice if eventually logical_type() returned Option<&LogicalType> and we could remove logical_type_ref().

@mbrobbel
Copy link
Member

mbrobbel commented Nov 5, 2025

Thanks @mbrobbel and @alamb!

I wonder if we should add a comment to LogicalType that it isn't trivially copy'able (and remove the Copy in later breaking release)?

I did add a note to the getters that the value is cloned. I just wonder do we want to deprecate the cloning getters? It would be nice if eventually logical_type() returned Option<&LogicalType> and we could remove logical_type_ref().

+1 for logical_type() -> Option<&LogicalType>

@etseidl
Copy link
Contributor Author

etseidl commented Nov 5, 2025

+1 for logical_type() -> Option<&LogicalType>

What would the mechanics of such a change be? Do we deprecate with a note that the API will change in a future release? Or do we just change the signature in 58.0.0 and note the breaking change?

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

What would the mechanics of such a change be? Do we deprecate with a note that the API will change in a future release? Or do we just change the signature in 58.0.0 and note the breaking change?

I think deprecate with a note is a good idea

To be clear:

  1. Deprecate logical_type() in 57.1.0
  2. In 58.0.0: Change the signature in 58.0.0 to logical_type() -> Option<&LogicalType> and remove Copy from LogicalType

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

I tried this out and didn't see any noticable performance change in my test, but I will make a PR anyways to discuss
#8789

@etseidl
Copy link
Contributor Author

etseidl commented Nov 5, 2025

Given #8789 is open I don't think this needs any further work. Thanks again for the quick reviews.

@etseidl etseidl merged commit 40300ca into apache:main Nov 5, 2025
18 checks passed
@etseidl etseidl deleted the log_type_ref branch November 5, 2025 22:02
etseidl added a commit to etseidl/arrow-rs that referenced this pull request Nov 7, 2025
…e#8786)

# Which issue does this PR close?

- Partial fix for apache#8783.

# Rationale for this change

An appreciable slowdown in Parquet writing was noticed. At least part of
the slowdown seems to stem from changes to `basic::LogicalType` which
caused cloning the enum to take make longer than previously.

# What changes are included in this PR?

This adds a new `logical_type_ref` call to `BasicTypeInfo` and
`ColumnDescriptor`. Unlike the existing `logical_type` which returns a
cloned `Option<LogicalType>`, the new methods return
`Option<&LogicalType>`. This new function is used in place of
`logical_type` internally.

# Are these changes tested?

Should be covered by existing tests.

# Are there any user-facing changes?
No.

A further optimization would be to change `ColumnOrder::get_sort_order`
to take an `Option<&LogicalType>`, but that is a breaking API change.
alamb added a commit that referenced this pull request Nov 11, 2025
…`, deprecate `get_logical_type` (#8789)

# Which issue does this PR close?

- Builds on #8786 from @etseidl 

# Rationale for this change

@etseidl  suggested:
> A further optimization would be to change ColumnOrder::get_sort_order
to take an Option<&LogicalType>, but that is a breaking API change.

# What changes are included in this PR?

1. Deprecate `ColumnOrder::get_sort_order`
2. Add ColumnOrder::sort_order_for_type` that returns the same value

# Are these changes tested?

Yes by CI

Note I was not able to observe any changes in benchmark performance with
this, so I am not sure it it worth doing

# Are there any user-facing changes?
There is a new API but I think there not many people use this (low
level) API so the disruption should be minimal

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants