Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NULL in list simplifications #8688

Closed
alamb opened this issue Dec 30, 2023 · 4 comments · Fixed by #8691
Closed

Add NULL in list simplifications #8688

alamb opened this issue Dec 30, 2023 · 4 comments · Fixed by #8691
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

Is your feature request related to a problem or challenge?

SELECT .. WHERE NULL IN (1,2,3) and SELECT ... WHERE x in (NULL, 2, 3) are always NULL (and thus will filter out all rows). However, DataFusion will still try and evaluate a predicate:

DataFusion CLI v34.0.0
❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.003 seconds.

Note there is a FilterExec with a non trivial expression in both of the following queries:

❯ explain select x from t where x IN (null, 2, 3);
+---------------+---------------------------------------------------------------+
| plan_type     | plan                                                          |
+---------------+---------------------------------------------------------------+
| logical_plan  | Filter: t.x = Int32(NULL) OR t.x = Int32(2) OR t.x = Int32(3) |
|               |   TableScan: t projection=[x]                                 |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                   |
|               |   FilterExec: x@0 = NULL OR x@0 = 2 OR x@0 = 3                |
|               |     MemoryExec: partitions=1, partition_sizes=[1]             |
|               |                                                               |
+---------------+---------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

❯ explain select x from t where null IN (x, 2, 3);
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: Int64(NULL) IN ([CAST(t.x AS Int64), Int64(2), Int64(3)])                                                                                                                                                                                                                                                                                                                                   |
|               |   TableScan: t projection=[x]                                                                                                                                                                                                                                                                                                                                                                       |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                         |
|               |   FilterExec: NULL IN ([CastExpr { expr: Column { name: "x", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false, format_options: FormatOptions { safe: true, null: "", date_format: None, datetime_format: None, timestamp_format: None, timestamp_tz_format: None, time_format: None, duration_format: Pretty } } }, Literal { value: Int64(2) }, Literal { value: Int64(3) }]) |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                                                                                                                                                                                                                                                                                                                   |
|               |                                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

In both cases the predicate could have been reduced to a single NULL

Describe the solution you'd like

I would like to extend the ExprSimplifier rules to handle the case of NULL IN (...) and when the InList contains NULL

Here are some similar rules
https://github.com/apache/arrow-datafusion/blob/cc3042a6343457036770267f921bb3b6e726956c/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L474-L549

Describe alternatives you've considered

No response

Additional context

These types of expressions are sometimes generated programmatically during rewrites in IOx.

It also came up with discussions with @yahoNanJing on #8669

I think this would be a good first issue as the patterns exist already and the need is well defined.

@alamb alamb added enhancement New feature or request good first issue Good for newcomers labels Dec 30, 2023
@alamb alamb changed the title Add NULL in list simplifcation Add NULL in list simplifications Dec 30, 2023
@asimsedhain
Copy link
Contributor

I would like to take a look at this.

If my understanding is correct, the first case i.e. SELECT .. WHERE NULL IN (…) would simplify to SELECT .. WHERE False.

As for the second case, SELECT .. WHERE x IN (NULL, …) would be simplified to SELECT .. WHERE x IN (…) as it could still return rows for the other exprs in the list.

asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023
asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023
@alamb
Copy link
Contributor Author

alamb commented Dec 31, 2023

I would like to take a look at this.

Thank you !

If my understanding is correct, the first case i.e. SELECT .. WHERE NULL IN (…) would simplify to SELECT .. WHERE False.

I think it actually would simplify to SELECT .. WHERE null (which is technically different than WHERE false but will have the same effect of returning no rows). This is explained a little more in #8689

Using datafusion-cli, you can check it:

❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.007 seconds.

❯ select NULL IN (1, 2);
+----------------------------------------------------------------------+
| NULL IN (Map { iter: Iter([Literal(Int64(1)), Literal(Int64(2))]) }) |
+----------------------------------------------------------------------+
|                                                                      |
+----------------------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

As for the second case, SELECT .. WHERE x IN (NULL, …) would be simplified to SELECT .. WHERE x IN (…) as it could still return rows for the other exprs in the list.

I actually tried this out and it seems like this is NOT a correct simplification (sorry for my confusion). So maybe we can't simplify such select lists 🤔

DataFusion CLI v34.0.0
❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.006 seconds.

❯ select x IN (NULL, 2) from t;
+-----------------------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(NULL), Literal(Int64(2))]) }) |
+-----------------------------------------------------------------+
|                                                                 |
| true                                                            |
+-----------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

❯ select x IN (2) from t;
+--------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(Int64(2))]) }) |
+--------------------------------------------------+
| false                                            |
| true                                             |
+--------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

❯

Also, the same in postgres:

postgres=# create table t (x int);
CREATE TABLE
postgres=# insert into t values (1), (2);
INSERT 0 2

postgres=# select x in (NULL, 2) from t;
 ?column?
----------

 t
(2 rows)

postgres=# select x in (2) from t;
 ?column?
----------
 f
 t
(2 rows)

asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023
asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023
@asimsedhain
Copy link
Contributor

I see how they are different for the first case. Thank you for explaining.

As for the second case, I agree we might not be able to simplify the list. Any expr in (null, e1, e2...) will be equivalent to expr=null or expr=e1 or .... We cannot really convert the expresion into null without evaluating expr=e1.

Updated the PR to only address the first case.

alamb pushed a commit that referenced this issue Jan 1, 2024
@Ted-Jiang
Copy link
Member

I would like to take a look at this.

Thank you !

If my understanding is correct, the first case i.e. SELECT .. WHERE NULL IN (…) would simplify to SELECT .. WHERE False.

I think it actually would simplify to SELECT .. WHERE null (which is technically different than WHERE false but will have the same effect of returning no rows). This is explained a little more in #8689

Using datafusion-cli, you can check it:

❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.007 seconds.

❯ select NULL IN (1, 2);
+----------------------------------------------------------------------+
| NULL IN (Map { iter: Iter([Literal(Int64(1)), Literal(Int64(2))]) }) |
+----------------------------------------------------------------------+
|                                                                      |
+----------------------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

As for the second case, SELECT .. WHERE x IN (NULL, …) would be simplified to SELECT .. WHERE x IN (…) as it could still return rows for the other exprs in the list.

I actually tried this out and it seems like this is NOT a correct simplification (sorry for my confusion). So maybe we can't simplify such select lists 🤔

DataFusion CLI v34.0.0
❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.006 seconds.

❯ select x IN (NULL, 2) from t;
+-----------------------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(NULL), Literal(Int64(2))]) }) |
+-----------------------------------------------------------------+
|                                                                 |
| true                                                            |
+-----------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

❯ select x IN (2) from t;
+--------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(Int64(2))]) }) |
+--------------------------------------------------+
| false                                            |
| true                                             |
+--------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

❯

Also, the same in postgres:

postgres=# create table t (x int);
CREATE TABLE
postgres=# insert into t values (1), (2);
INSERT 0 2

postgres=# select x in (NULL, 2) from t;
 ?column?
----------

 t
(2 rows)

postgres=# select x in (2) from t;
 ?column?
----------
 f
 t
(2 rows)

Thanks for explain this! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants