Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IS NOT NULL predicate rewrite is incorrect #9230

Closed
alamb opened this issue Feb 14, 2024 · 6 comments · Fixed by #9232
Closed

IS NOT NULL predicate rewrite is incorrect #9230

alamb opened this issue Feb 14, 2024 · 6 comments · Fixed by #9232
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Feb 14, 2024

Describe the bug

The logic introduced in #9208 is (very subtly) incorrect as I found while upgrading to use it in InfluxDB

To Reproduce

The data is like this

name: cpu
+---------------------+--------------+
| time                | usage_system |
+---------------------+--------------+
| 1970-01-01T00:01:00 | 99.8         |
| 1970-01-01T00:01:00 | 89.5         |
| 1970-01-01T00:01:10 | 88.6         |
| 1970-01-01T00:01:10 | 99.7         |
| 1970-01-01T00:01:20 |              |
| 1970-01-01T00:01:20 |              |
| 1970-01-01T00:01:30 | 83.4         |
| 1970-01-01T00:01:30 |              |
| 1970-01-01T00:01:40 | 87.7         |
| 1970-01-01T00:01:40 |              |
| 1970-01-01T00:01:50 |              |
| 1970-01-01T00:01:50 |              |
| 1970-01-01T00:02:00 |              |
| 1970-01-01T00:02:00 | 99.9         |
| 1970-01-01T00:02:10 | 89.8         |
| 1970-01-01T00:02:10 | 99.8         |
| 1970-01-01T00:02:20 |              |
| 1970-01-01T00:02:20 | 99.9         |
| 1970-01-01T00:02:30 |              |
| 1970-01-01T00:02:30 |              |
| 1970-01-01T00:02:40 |              |
| 1970-01-01T00:02:40 |              |
| 1970-01-01T00:02:50 | 89.8         |
| 1970-01-01T00:02:50 |              |
| 1970-01-01T00:03:00 | 90.0         |
| 1970-01-01T00:03:00 | 99.8         |
| 1970-01-01T00:03:10 |              |
| 1970-01-01T00:03:10 | 99.8         |
+---------------------+--------------+

Note it has BOTH 14 null values and 14 non null values for usage_system and non null values

The original predicate was

usage_system@3 IS NOT NULL AND time@1 <= 1707945813397450000

And the rewritten predicate is (now)

usage_system_null_count@0 = 0 AND time_min@1 <= 1707945813397450000

Thus, the input data statistics look like this

+-------------------------+---------------------+
| usage_system_null_count | time_min            |
+-------------------------+---------------------+
| 14                      | 1970-01-01T00:01:00 |
+-------------------------+---------------------+

With these statistics this predicate evaluates to false and the data is pruned,

Expected behavior

The data should not be pruned because there are rows for which IS NOT NULL evaluates to true (there are non null values in the data)

Additional context

The pruning predicate needs to return false only if "there are no rows that could possibly match the predicate", which for IS NOT NULL means that there are only null values in the data, but the current implementation checks if there are any null values in the data.

So in this case, that the original predicate needs to be rewritten to

usage_system_null_count@0 = usage_system_row_count@0  AND time_min@1 <= 1707945813397450000

We of course don't have row_count information yet (but @appletreeisyellow) is working on adding in #9171

This kind of subtle logic bug is another reason I think we should be seriously considering the range based analysis described in #7887

@alamb
Copy link
Contributor Author

alamb commented Feb 14, 2024

I believe @appletreeisyellow plans to revert #9208 and then work on this feature as part of #9231

@viirya
Copy link
Member

viirya commented Feb 14, 2024

Ah, right. The IS NOT NULL predicate should disallow row groups which contains ALL nulls, not disallow which contains any nulls.

@alamb
Copy link
Contributor Author

alamb commented Feb 14, 2024

Yeah I am embarassed I didn't catch it in review

@viirya
Copy link
Member

viirya commented Feb 14, 2024

I feel a little strange when reviewing the PR, and thought I missed something. That is! 😄

@alamb
Copy link
Contributor Author

alamb commented Feb 14, 2024

This logic is very subtle. As I mentioned above, I am really thinking that to make the pruning predicate work more generally we need a different approach than this particular style of rewrite (I am partial to #7887)

@appletreeisyellow
Copy link
Contributor

Thanks for catching it so quickly @alamb and thanks for the quick review on the revert PR #9232 @alamb and @viirya!

The pruning predicate needs to return false only if "there are no rows that could possibly match the predicate", which for IS NOT NULL means that there are only null values in the data, but the current implementation checks if there are any null values in the data.

📌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants