Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: struct field don't push down to TableScan #8774

Merged
merged 6 commits into from
Jan 8, 2024
Merged

Conversation

haohuaijin
Copy link
Contributor

Which issue does this PR close?

Closes #8735

Rationale for this change

when debug #8735, I find not only the case in #8735 don't push down, we have other case don't push down
for example

DataFusion CLI v34.0.0
❯ create table t(a int, b int, c int);
0 rows in set. Query took 0.014 seconds.

❯ explain select -a from t;
+---------------+-------------------------------------------------+
| plan_type     | plan                                            |
+---------------+-------------------------------------------------+
| logical_plan  | Projection: (- t.a)                             |
|               |   TableScan: t projection=[a, b, c]             |
| physical_plan | ProjectionExec: expr=[(- a@0) as (- t.a)]       |
|               |   MemoryExec: partitions=1, partition_sizes=[0] |
|               |                                                 |
+---------------+-------------------------------------------------+
2 rows in set. Query took 0.016 seconds.

What changes are included in this PR?

add more case to outer_columns_helper.

Are these changes tested?

yes, add more tests

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 6, 2024
@@ -149,7 +149,7 @@ select count(*) from m1 where tag_id = '1000' and time < '2024-01-03T14:46:35+01
10

query RRR
select min(f5), max(f5), avg(f5) from m2 where tag_id = '1000' and time < '2024-01-03T14:46:35+01:00' group by type;
select min(f5), max(f5), avg(f5) from m2 where tag_id = '1000' and time < '2024-01-03T14:46:35+01:00' group by type order by min(f5);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make result stable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different fix is here: #8769

@haohuaijin
Copy link
Contributor Author

haohuaijin commented Jan 6, 2024

the root reason maybe in below code
https://github.com/apache/arrow-datafusion/blob/4e4059a68455fbc14f04902c76acbcd258b7f2ef/datafusion/optimizer/src/optimize_projections.rs#L771-L775

I'm not sure why we handle it this way. I tried commenting out this line, and all the tests still passed(the only test don't pass is not related to this, and I comment above in dictionary.slt.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @haohuaijin -- given this PR fixes the bug and includes good test coverage ❤️ I think we could merge it as is.

However, I had some suggestions to improve the code even more -- do you think we should try to do those as part of this PR or maybe as a follow on PR?

| Expr::IsNotUnknown(expr)
| Expr::IsNotNull(expr)
| Expr::IsNull(expr)
| Expr::Negative(expr) => outer_columns_helper(expr, columns),
Expr::Column(_) | Expr::Literal(_) | Expr::Wildcard { .. } => true,
_ => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about removing this catchall (and and thus explicitly enumerating all Expr types so this type of bug can't happen again)?

Suggested change
_ => false,

An alternate idea would be to remove this explicit walk down the tree entirely and use https://docs.rs/datafusion/latest/datafusion/logical_expr/utils/fn.expr_to_columns.html instead?

This pushdown code may predate the more generaic tree walks

Copy link
Contributor Author

@haohuaijin haohuaijin Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Already removed the catchall.

An alternate idea would be to remove this explicit walk down the tree entirely and use https://docs.rs/datafusion/latest/datafusion/logical_expr/utils/fn.expr_to_columns.html instead?

the outer_columns_helper function is used to collect all outer_ref_column that expr_to_columns function doesn't return these columns. Because expr_to_columns use TreeNode to walks the expr tree, but TreeNode don't walk the outer_ref_column
https://github.com/apache/arrow-datafusion/blob/dd4263f843e093c807d63edf73a571b1ba2669b5/datafusion/expr/src/tree_node/expr.rs#L69-L77

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the outer_columns_helper function is used to collect all outer_ref_column that expr_to_columns function doesn't return these columns. Because expr_to_columns use TreeNode to walks the expr tree, but TreeNode don't walk the outer_ref_column

I see -- thank you for the explanation, which makes sense. It fiddled a little with the code, and I think I have a cleanup here that uses the TreeNode walking to avoid having to enumerate Expr variants #8787

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @haohuaijin

| Expr::IsNotNull(expr)
| Expr::IsNull(expr)
| Expr::Negative(expr) => outer_columns_helper(expr, columns),
Expr::Column(_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jan 8, 2024
@alamb alamb merged commit cc42894 into apache:main Jan 8, 2024
22 checks passed
@haohuaijin haohuaijin deleted the fix branch January 9, 2024 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: Unneeded fields pushed to TableProvider if struct field is part of query
3 participants