`IS NOT NULL` predicate rewrite is incorrect #9230

alamb · 2024-02-14T21:45:36Z

Describe the bug

The logic introduced in #9208 is (very subtly) incorrect as I found while upgrading to use it in InfluxDB

To Reproduce

The data is like this

name: cpu
+---------------------+--------------+
| time                | usage_system |
+---------------------+--------------+
| 1970-01-01T00:01:00 | 99.8         |
| 1970-01-01T00:01:00 | 89.5         |
| 1970-01-01T00:01:10 | 88.6         |
| 1970-01-01T00:01:10 | 99.7         |
| 1970-01-01T00:01:20 |              |
| 1970-01-01T00:01:20 |              |
| 1970-01-01T00:01:30 | 83.4         |
| 1970-01-01T00:01:30 |              |
| 1970-01-01T00:01:40 | 87.7         |
| 1970-01-01T00:01:40 |              |
| 1970-01-01T00:01:50 |              |
| 1970-01-01T00:01:50 |              |
| 1970-01-01T00:02:00 |              |
| 1970-01-01T00:02:00 | 99.9         |
| 1970-01-01T00:02:10 | 89.8         |
| 1970-01-01T00:02:10 | 99.8         |
| 1970-01-01T00:02:20 |              |
| 1970-01-01T00:02:20 | 99.9         |
| 1970-01-01T00:02:30 |              |
| 1970-01-01T00:02:30 |              |
| 1970-01-01T00:02:40 |              |
| 1970-01-01T00:02:40 |              |
| 1970-01-01T00:02:50 | 89.8         |
| 1970-01-01T00:02:50 |              |
| 1970-01-01T00:03:00 | 90.0         |
| 1970-01-01T00:03:00 | 99.8         |
| 1970-01-01T00:03:10 |              |
| 1970-01-01T00:03:10 | 99.8         |
+---------------------+--------------+

Note it has BOTH 14 null values and 14 non null values for usage_system and non null values

The original predicate was

usage_system@3 IS NOT NULL AND time@1 <= 1707945813397450000

And the rewritten predicate is (now)

usage_system_null_count@0 = 0 AND time_min@1 <= 1707945813397450000

Thus, the input data statistics look like this

+-------------------------+---------------------+
| usage_system_null_count | time_min            |
+-------------------------+---------------------+
| 14                      | 1970-01-01T00:01:00 |
+-------------------------+---------------------+

With these statistics this predicate evaluates to false and the data is pruned,

Expected behavior

The data should not be pruned because there are rows for which IS NOT NULL evaluates to true (there are non null values in the data)

Additional context

The pruning predicate needs to return false only if "there are no rows that could possibly match the predicate", which for IS NOT NULL means that there are only null values in the data, but the current implementation checks if there are any null values in the data.

So in this case, that the original predicate needs to be rewritten to

usage_system_null_count@0 = usage_system_row_count@0  AND time_min@1 <= 1707945813397450000

We of course don't have row_count information yet (but @appletreeisyellow) is working on adding in #9171

This kind of subtle logic bug is another reason I think we should be seriously considering the range based analysis described in #7887

The text was updated successfully, but these errors were encountered:

alamb · 2024-02-14T21:56:51Z

I believe @appletreeisyellow plans to revert #9208 and then work on this feature as part of #9231

viirya · 2024-02-14T21:58:08Z

Ah, right. The IS NOT NULL predicate should disallow row groups which contains ALL nulls, not disallow which contains any nulls.

alamb · 2024-02-14T21:59:25Z

Yeah I am embarassed I didn't catch it in review

viirya · 2024-02-14T21:59:33Z

I feel a little strange when reviewing the PR, and thought I missed something. That is! 😄

alamb · 2024-02-14T22:00:59Z

This logic is very subtle. As I mentioned above, I am really thinking that to make the pruning predicate work more generally we need a different approach than this particular style of rewrite (I am partial to #7887)

appletreeisyellow · 2024-02-14T22:07:06Z

Thanks for catching it so quickly @alamb and thanks for the quick review on the revert PR #9232 @alamb and @viirya!

The pruning predicate needs to return false only if "there are no rows that could possibly match the predicate", which for IS NOT NULL means that there are only null values in the data, but the current implementation checks if there are any null values in the data.

📌

alamb added the bug Something isn't working label Feb 14, 2024

This was referenced Feb 14, 2024

Consolidate interval analysies from Interval and PruningPredicate #7887

Open

chore(pruning): Support IS NOT NULL predicates in PruningPredicate #9208

Merged

Support IS NOT NULL predicates in PruningPredicate #9231

Closed

appletreeisyellow mentioned this issue Feb 14, 2024

Revert "chore(pruning): Support IS NOT NULL predicates in PruningPredicate (#9208)" #9232

Merged

alamb closed this as completed in #9232 Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`IS NOT NULL` predicate rewrite is incorrect #9230

`IS NOT NULL` predicate rewrite is incorrect #9230

alamb commented Feb 14, 2024

alamb commented Feb 14, 2024 •

edited

Loading

viirya commented Feb 14, 2024

alamb commented Feb 14, 2024

viirya commented Feb 14, 2024

alamb commented Feb 14, 2024

appletreeisyellow commented Feb 14, 2024

IS NOT NULL predicate rewrite is incorrect #9230

IS NOT NULL predicate rewrite is incorrect #9230

Comments

alamb commented Feb 14, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Feb 14, 2024 • edited Loading

viirya commented Feb 14, 2024

alamb commented Feb 14, 2024

viirya commented Feb 14, 2024

alamb commented Feb 14, 2024

appletreeisyellow commented Feb 14, 2024

`IS NOT NULL` predicate rewrite is incorrect #9230

`IS NOT NULL` predicate rewrite is incorrect #9230

alamb commented Feb 14, 2024 •

edited

Loading