Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Fixing aggregation protocol to be appropriately associative #50757

Merged
merged 39 commits into from
Feb 21, 2025

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Feb 20, 2025

Why are these changes needed?

This is a follow-up for #50585, addressing an issue of its combination sequence not being appropriately associative.

Primary hurdle for implementing properly associative aggregation in the presence null values is to be able to distinguish between:

  • Empty accumulator
  • Accumulator holding single value that is null

To achieve that in the presence of null values following semantic is established.

Case of ignore_nulls=True:

  • Combination protocol

    • If current accumulator is null (ie empty), return new accumulator
    • If new accumulator is null (ie empty), return cur
    • Otherwise combine (current and new)
  • Identity (zero) value is None (ie simulating 'empty' sequence)

Case of ignore_nulls=False:

  • Combination protocol

    • If new accumulator is null (ie has null in the sequence, b/c we're
      NOT ignoring nulls), return it
    • If current accumulator is null (ie had null in the prior sequence,
      b/c we're NOT ignoring nulls), return it
    • Otherwise combine (current and new)
  • Identity (zero) value is an actual zero value for the operation (0 for count, sum, -inf for max, +inf for min, etc)

Changes

  • Revisited aggregation protocol rebasing it onto _Optional
  • Added explicit aggregation protocol test
  • Fixed Quantile finalization seq
  • Fixed Std/Mean incorrectly handling NaNs

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@alexeykudinkin alexeykudinkin requested a review from a team as a code owner February 20, 2025 06:40
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Feb 20, 2025
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…impls

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Fixed Pandas/Arrow impls;

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ntainer)

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
"""

def _safe_zero_factory(_):
if ignore_nulls:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, we can move ignore_null check out of this function. So it doesn't need to be checked for each single record

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Fixed for combination seq as well

Unique: lambda col, ignore_nulls: set(pac.unique(col).to_pylist()),
}

return _map[agg_cls]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need add the following at the end of this file.

if __name__ == "__main__":
    sys.exit(pytest.main(["-v", __file__]))

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, keep forgetting about it

return _safe_finalize


def _null_safe_combine(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add some unit test for these wrappers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all tested explicitly in test_null_safe_aggregation_protocol

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ecution

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@raulchen raulchen merged commit e1f8524 into master Feb 21, 2025
5 checks passed
@raulchen raulchen deleted the ak/aggr-nan-fix branch February 21, 2025 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants