-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-45639: [C++][Statistics] Add ARROW:average_byte_width:exact and ARROW:average_byte_width:approximate statistics attributes to arrow::ArrayStatistics #46385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
average_byte_width average_byte_width_exact
aa0ecc9
to
a4c1bb4
Compare
@kou, I’ve converted this pull request to a draft in order to clarify whether any modifications are required on the Parquet side. |
@@ -131,7 +137,9 @@ struct ARROW_EXPORT ArrayStatistics { | |||
bool Equals(const ArrayStatistics& other) const { | |||
return null_count == other.null_count && distinct_count == other.distinct_count && | |||
min == other.min && is_min_exact == other.is_min_exact && max == other.max && | |||
is_max_exact == other.is_max_exact; | |||
is_max_exact == other.is_max_exact && | |||
average_byte_width == other.average_byte_width && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use ==
for floating point numbers.
We need something like
arrow/cpp/src/arrow/compare.cc
Lines 85 to 94 in 5989387
if (x == y) { | |
return Flags::signed_zeros_equal || (std::signbit(x) == std::signbit(y)); | |
} | |
if (Flags::nans_equal && std::isnan(x) && std::isnan(y)) { | |
return true; | |
} | |
if (Flags::approximate && (fabs(x - y) <= epsilon)) { | |
return true; | |
} | |
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the following questions regarding the implementation:
1-Should I first create a separate issue to address the min
and max
member variables, given that they can be double
values?
2- I plan to add arrow::EqualOptions
to the Equal
method. This would make the operator==
redundant. Is it acceptable to remove operator==
in this case?
3-When comparing two arrow::Array
objects for equality, should their arrow::ArrayStatistics
also be checked?( the current implementation does not check ArrayStatistics)
4-Should I implement a separate ApproximateEqual
method to handle EqualOptions::atol,
or should this logic be integrated into the Equal
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Ah, I didn't notice it. Yes.
- Yes.
- (I think) No. I think
arrow::Array::Equals()
should check only data. For example, metadata in its data type isn't used. But we may want to discuss this in a separated issue. - Yes or no. Let's discuss it as a separated issue. I think that adding
floating_approximate
toEqualOptions
and using onlyEquals
is better but others may think differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I start to create an issue to solve the problem of min
and max
first
@kou thank you for your feedback, I will look into it soon |
Rationale for this change
Add
ARROW:average_byte_width:exact
andARROW:average_byte_width:approximate
statistics attributes toarrow::ArrayStatistics
What changes are included in this PR?
Add
average_byte_width
andis_average_byte_width_exact
member variables toarrow::ArrayStatistics
Are these changes tested?
Yes, I run the relevant unit tests
Are there any user-facing changes?
Yes