-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge?
This has come up a few times, most recently in discussions with @berkaysynnada on apache/arrow-rs#5037 (comment)
Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down
For example, for a string column that has very long values, it requires much less space to store a short value slightly lower than the actual minimum as the "minimum" statistics value, and one that is slightly higher than the actual maximum as the "maximum" statistics value.
For example:
actual min in data | actual max in data | "min" value in statistics | "max" value in statistics |
---|---|---|---|
aaa......z |
qqq......q |
a |
r |
There is a similar usecase when applying a Filter, as described by @korowa on #5646 (comment) and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column
Currently Precision
only represents Exact
and Inexact
, there is no way to represent "unexact, but bounded above/below"
Describe the solution you'd like
Per @berkaysynnada I propose changing Precision::Inexact
to a new variant Precision::Between
which would store an Interval
of known min/maxes of the value.
enum Precision {
...
/// The value is known to be in the specified interval
Between(Interval)
}
This is a quite general formulation, and it can describe "how" inexact the values are.
This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)
Describe alternatives you've considered
There is also a possibility of introducing a simpler, but more limited version of these statistics, like:
enum Precision {
// The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
// but the actual values may be lower/higher.
Bounded(ScalarValue)
}
Additional context
No response