Skip to content

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This has come up a few times, most recently in discussions with @berkaysynnada on apache/arrow-rs#5037 (comment)

Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down

For example, for a string column that has very long values, it requires much less space to store a short value slightly lower than the actual minimum as the "minimum" statistics value, and one that is slightly higher than the actual maximum as the "maximum" statistics value.

For example:

actual min in data actual max in data "min" value in statistics "max" value in statistics
aaa......z qqq......q a r

There is a similar usecase when applying a Filter, as described by @korowa on #5646 (comment) and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column

Currently Precision only represents Exact and Inexact, there is no way to represent "unexact, but bounded above/below"

Describe the solution you'd like

Per @berkaysynnada I propose changing Precision::Inexact to a new variant Precision::Between which would store an Interval of known min/maxes of the value.

enum Precision {
  ...
  /// The value is known to be in the specified interval
  Between(Interval)
}

This is a quite general formulation, and it can describe "how" inexact the values are.

This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)

Describe alternatives you've considered

There is also a possibility of introducing a simpler, but more limited version of these statistics, like:

enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions