Skip to content

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jun 4, 2025

Which issue does this PR close?

Part of #1824

Rationale for this change

Comet is not compatible with Spark for aggregate queries that use the aggregate expression count(distinct) on a column that contains NaN values. This appears to be a bug in DataFusion (apache/datafusion#16254).

What changes are included in this PR?

Explain the bug in the compatibility guide and ignore the Spark SQL test.

Note that the Spark SQL test currently passes only because the query falls back to Spark, but this will no longer be the case once the COMET_SHUFFLE_FALLBACK_TO_COLUMNAR config is removed.

We should eventually fix the bug, but let's at least document it for now.

How are these changes tested?

Comment on lines -32 to -37
# Compatibility Guide

Comet aims to provide consistent results with the version of Apache Spark that is being used.

This guide offers information about areas of functionality where there are known differences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thsi section was duplicated

@andygrove andygrove changed the title chore: Update documentation and ignore Spark SQL tests for known issue chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate Jun 4, 2025
@andygrove andygrove marked this pull request as ready for review June 4, 2025 22:46
@codecov-commenter
Copy link

codecov-commenter commented Jun 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.40%. Comparing base (f09f8af) to head (6f90a2b).
Report is 242 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1847      +/-   ##
============================================
+ Coverage     56.12%   59.40%   +3.28%     
- Complexity      976     1151     +175     
============================================
  Files           119      130      +11     
  Lines         11743    12663     +920     
  Branches       2251     2374     +123     
============================================
+ Hits           6591     7523     +932     
+ Misses         4012     3930      -82     
- Partials       1140     1210      +70     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Member Author

I reported the bug in DataFusion yesterday and there is already a fix apache/datafusion#16256

Perhaps this will be included in the 48.0.0 release, so moving this to draft for now.

@andygrove andygrove marked this pull request as draft June 5, 2025 13:21
@andygrove andygrove marked this pull request as ready for review June 6, 2025 18:58
@andygrove
Copy link
Member Author

Thanks for the review @parthchandra. I will go ahead and merge this and then re-enable the tests once we upgrade to DataFusion 48

@andygrove andygrove merged commit cdfdc21 into apache:main Jun 6, 2025
78 checks passed
@andygrove andygrove deleted the count-distinct-nan-in-aggregates branch June 6, 2025 18:59
YanivKunda pushed a commit to YanivKunda/datafusion-comet that referenced this pull request Jun 8, 2025
YanivKunda pushed a commit to YanivKunda/datafusion-comet that referenced this pull request Jun 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants