-
Notifications
You must be signed in to change notification settings - Fork 239
chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847
Conversation
# Compatibility Guide | ||
|
||
Comet aims to provide consistent results with the version of Apache Spark that is being used. | ||
|
||
This guide offers information about areas of functionality where there are known differences. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thsi section was duplicated
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1847 +/- ##
============================================
+ Coverage 56.12% 59.40% +3.28%
- Complexity 976 1151 +175
============================================
Files 119 130 +11
Lines 11743 12663 +920
Branches 2251 2374 +123
============================================
+ Hits 6591 7523 +932
+ Misses 4012 3930 -82
- Partials 1140 1210 +70 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I reported the bug in DataFusion yesterday and there is already a fix apache/datafusion#16256 Perhaps this will be included in the 48.0.0 release, so moving this to draft for now. |
Thanks for the review @parthchandra. I will go ahead and merge this and then re-enable the tests once we upgrade to DataFusion 48 |
…e with count distinct on NaN in aggregate (apache#1847)
…e with count distinct on NaN in aggregate (apache#1847)
Which issue does this PR close?
Part of #1824
Rationale for this change
Comet is not compatible with Spark for aggregate queries that use the aggregate expression
count(distinct)
on a column that containsNaN
values. This appears to be a bug in DataFusion (apache/datafusion#16254).What changes are included in this PR?
Explain the bug in the compatibility guide and ignore the Spark SQL test.
Note that the Spark SQL test currently passes only because the query falls back to Spark, but this will no longer be the case once the
COMET_SHUFFLE_FALLBACK_TO_COLUMNAR
config is removed.We should eventually fix the bug, but let's at least document it for now.
How are these changes tested?