chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847

andygrove · 2025-06-04T22:32:48Z

Which issue does this PR close?

Part of #1824

Rationale for this change

Comet is not compatible with Spark for aggregate queries that use the aggregate expression count(distinct) on a column that contains NaN values. This appears to be a bug in DataFusion (apache/datafusion#16254).

What changes are included in this PR?

Explain the bug in the compatibility guide and ignore the Spark SQL test.

Note that the Spark SQL test currently passes only because the query falls back to Spark, but this will no longer be the case once the COMET_SHUFFLE_FALLBACK_TO_COLUMNAR config is removed.

We should eventually fix the bug, but let's at least document it for now.

How are these changes tested?

andygrove · 2025-06-04T22:33:11Z

docs/source/user-guide/compatibility.md

-# Compatibility Guide
-
-Comet aims to provide consistent results with the version of Apache Spark that is being used.
-
-This guide offers information about areas of functionality where there are known differences.
-


Thsi section was duplicated

codecov-commenter · 2025-06-04T23:09:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.40%. Comparing base (f09f8af) to head (6f90a2b).
Report is 242 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1847      +/-   ##
============================================
+ Coverage     56.12%   59.40%   +3.28%     
- Complexity      976     1151     +175     
============================================
  Files           119      130      +11     
  Lines         11743    12663     +920     
  Branches       2251     2374     +123     
============================================
+ Hits           6591     7523     +932     
+ Misses         4012     3930      -82     
- Partials       1140     1210      +70

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-06-05T13:21:28Z

I reported the bug in DataFusion yesterday and there is already a fix apache/datafusion#16256

Perhaps this will be included in the 48.0.0 release, so moving this to draft for now.

andygrove · 2025-06-06T18:59:13Z

Thanks for the review @parthchandra. I will go ahead and merge this and then re-enable the tests once we upgrade to DataFusion 48

…e with count distinct on NaN in aggregate (apache#1847)

andygrove added 2 commits June 4, 2025 16:24

ignore test for 3.5.5

2cc00c2

update docs

0fda09b

andygrove commented Jun 4, 2025

View reviewed changes

3.4.3

c4219cd

andygrove changed the title ~~chore: Update documentation and ignore Spark SQL tests for known issue~~ chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate Jun 4, 2025

andygrove added 2 commits June 4, 2025 16:43

3.5.4

eb69d50

4.0.0-preview1

6f90a2b

andygrove marked this pull request as ready for review June 4, 2025 22:46

andygrove requested review from kazuyukitanimura and parthchandra June 5, 2025 00:24

andygrove marked this pull request as draft June 5, 2025 13:21

parthchandra approved these changes Jun 5, 2025

View reviewed changes

andygrove marked this pull request as ready for review June 6, 2025 18:58

andygrove merged commit cdfdc21 into apache:main Jun 6, 2025
78 checks passed

andygrove deleted the count-distinct-nan-in-aggregates branch June 6, 2025 18:59

YanivKunda pushed a commit to YanivKunda/datafusion-comet that referenced this pull request Jun 8, 2025

chore: Update documentation and ignore Spark SQL tests for known issu…

281d003

…e with count distinct on NaN in aggregate (apache#1847)

YanivKunda pushed a commit to YanivKunda/datafusion-comet that referenced this pull request Jun 8, 2025

chore: Update documentation and ignore Spark SQL tests for known issu…

7e187f8

…e with count distinct on NaN in aggregate (apache#1847)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847

chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847

Uh oh!

andygrove commented Jun 4, 2025 •

edited

Loading

Uh oh!

andygrove Jun 4, 2025

Uh oh!

codecov-commenter commented Jun 4, 2025 •

edited

Loading

Uh oh!

andygrove commented Jun 5, 2025

Uh oh!

andygrove commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847

chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847

Uh oh!

Conversation

andygrove commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jun 5, 2025

Uh oh!

andygrove commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

andygrove commented Jun 4, 2025 •

edited

Loading

codecov-commenter commented Jun 4, 2025 •

edited

Loading