Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistic: data_size should be in ColumnStatistics. #7548

Open
jackwener opened this issue Sep 13, 2023 · 2 comments
Open

Statistic: data_size should be in ColumnStatistics. #7548

jackwener opened this issue Sep 13, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@jackwener
Copy link
Member

jackwener commented Sep 13, 2023

Is your feature request related to a problem or challenge?

No response

Describe the solution you'd like

Current, we use total_byte_size to store whole byte_size.

But it isn't good enough, a better way is to put avg_data_size/total_date_size into ColumnStatistics.

a total_byte_size of statistic is useless, because we hard to propagate it in Statistic derive.
But if we use Column avg_data_size we can use it to propagate it into other Plan.

Spark:

case class ColumnStat(
    ....
    avgLen: Option[Long] = None,
    maxLen: Option[Long] = None,
    ...

Presto

public final class ColumnStatistics
{
    private final Estimate nullsFraction;
    private final Estimate distinctValuesCount;
    private final Estimate dataSize;
    private final Optional<DoubleRange> range;
}

Describe alternatives you've considered

No response

Additional context

No response

@jackwener jackwener added the enhancement New feature or request label Sep 13, 2023
@jackwener jackwener changed the title Statistic: date_size should be in column/expression. Statistic: date_size should be in ColumnStatistics. Sep 13, 2023
@alamb alamb changed the title Statistic: date_size should be in ColumnStatistics. Statistic: data_size should be in ColumnStatistics. Sep 16, 2023
@AdamGS
Copy link
Contributor

AdamGS commented Dec 5, 2024

I know this is a pretty old issue, but I would also be interested in it and capable of doing the implementation if there's interest by the maintainers.

@findepi
Copy link
Member

findepi commented Dec 7, 2024

@AdamGS +1 from me.
The average data size sounds most logical from optimizer's perspective
(i was involved in the introduction of ColumnStatistics.dataSize of Presto/Trino, but the reality is the value is so often divided by #rows ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants