Skip to content

GH-45639: [C++][Statistics] Add ARROW:average_byte_width:exact and ARROW:average_byte_width:approximate statistics attributes to arrow::ArrayStatistics #46385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andishgar
Copy link
Contributor

@andishgar andishgar commented May 10, 2025

Rationale for this change

Add ARROW:average_byte_width:exact and ARROW:average_byte_width:approximate statistics attributes to arrow::ArrayStatistics

What changes are included in this PR?

Add average_byte_width and is_average_byte_width_exact member variables to arrow::ArrayStatistics

Are these changes tested?

Yes, I run the relevant unit tests

Are there any user-facing changes?

Yes

Copy link

⚠️ GitHub issue #45639 has been automatically assigned in GitHub to PR creator.

@andishgar andishgar marked this pull request as draft May 10, 2025 06:46
average_byte_width
average_byte_width_exact
@andishgar andishgar force-pushed the add-byte-width-attribute branch from aa0ecc9 to a4c1bb4 Compare May 10, 2025 06:51
@andishgar andishgar changed the title GH-45639 : [C++][Statistics] Add ARROW:average_byte_width:exact and ARROW:average_byte_width:approximate statistics attributes to arrow::ArrayStatistic GH-45639 : [C++][Statistics] Add ARROW:average_byte_width:exact and ARROW:average_byte_width:approximate statistics attributes to arrow::ArrayStatistics May 10, 2025
@andishgar
Copy link
Contributor Author

@kou, I’ve converted this pull request to a draft in order to clarify whether any modifications are required on the Parquet side.

@kou kou changed the title GH-45639 : [C++][Statistics] Add ARROW:average_byte_width:exact and ARROW:average_byte_width:approximate statistics attributes to arrow::ArrayStatistics GH-45639: [C++][Statistics] Add ARROW:average_byte_width:exact and ARROW:average_byte_width:approximate statistics attributes to arrow::ArrayStatistics May 10, 2025
@@ -131,7 +137,9 @@ struct ARROW_EXPORT ArrayStatistics {
bool Equals(const ArrayStatistics& other) const {
return null_count == other.null_count && distinct_count == other.distinct_count &&
min == other.min && is_min_exact == other.is_min_exact && max == other.max &&
is_max_exact == other.is_max_exact;
is_max_exact == other.is_max_exact &&
average_byte_width == other.average_byte_width &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use == for floating point numbers.
We need something like

if (x == y) {
return Flags::signed_zeros_equal || (std::signbit(x) == std::signbit(y));
}
if (Flags::nans_equal && std::isnan(x) && std::isnan(y)) {
return true;
}
if (Flags::approximate && (fabs(x - y) <= epsilon)) {
return true;
}
return false;
instead.

Copy link
Contributor Author

@andishgar andishgar May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the following questions regarding the implementation:

1-Should I first create a separate issue to address the min and max member variables, given that they can be double values?

2- I plan to add arrow::EqualOptions to the Equal method. This would make the operator== redundant. Is it acceptable to remove operator== in this case?

3-When comparing two arrow::Array objects for equality, should their arrow::ArrayStatistics also be checked?( the current implementation does not check ArrayStatistics)

4-Should I implement a separate ApproximateEqual method to handle EqualOptions::atol, or should this logic be integrated into the Equal method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ah, I didn't notice it. Yes.
  2. Yes.
  3. (I think) No. I think arrow::Array::Equals() should check only data. For example, metadata in its data type isn't used. But we may want to discuss this in a separated issue.
  4. Yes or no. Let's discuss it as a separated issue. I think that adding floating_approximate to EqualOptions and using only Equals is better but others may think differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I start to create an issue to solve the problem of min and max first

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 11, 2025
@andishgar
Copy link
Contributor Author

@kou thank you for your feedback, I will look into it soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants