Skip to content

Commit b089137

Browse files
alambjackwenerWeijun-H
authored
Minor: Improve documentation for Filter Pushdown (#8023)
* Minor: Improve documentation for Fulter Pushdown * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: jakevin <jakevingoo@gmail.com> * Apply suggestions from code review * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> --------- Co-authored-by: jakevin <jakevingoo@gmail.com> Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
1 parent d2671cd commit b089137

File tree

1 file changed

+82
-19
lines changed

1 file changed

+82
-19
lines changed

datafusion/optimizer/src/push_down_filter.rs

Lines changed: 82 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
// specific language governing permissions and limitations
1313
// under the License.
1414

15-
//! Push Down Filter optimizer rule ensures that filters are applied as early as possible in the plan
15+
//! [`PushDownFilter`] Moves filters so they are applied as early as possible in
16+
//! the plan.
1617
1718
use crate::optimizer::ApplyOrder;
1819
use crate::utils::{conjunction, split_conjunction, split_conjunction_owned};
@@ -33,31 +34,93 @@ use itertools::Itertools;
3334
use std::collections::{HashMap, HashSet};
3435
use std::sync::Arc;
3536

36-
/// Push Down Filter optimizer rule pushes filter clauses down the plan
37+
/// Optimizer rule for pushing (moving) filter expressions down in a plan so
38+
/// they are applied as early as possible.
39+
///
3740
/// # Introduction
38-
/// A filter-commutative operation is an operation whose result of filter(op(data)) = op(filter(data)).
39-
/// An example of a filter-commutative operation is a projection; a counter-example is `limit`.
4041
///
41-
/// The filter-commutative property is column-specific. An aggregate grouped by A on SUM(B)
42-
/// can commute with a filter that depends on A only, but does not commute with a filter that depends
43-
/// on SUM(B).
42+
/// The goal of this rule is to improve query performance by eliminating
43+
/// redundant work.
44+
///
45+
/// For example, given a plan that sorts all values where `a > 10`:
46+
///
47+
/// ```text
48+
/// Filter (a > 10)
49+
/// Sort (a, b)
50+
/// ```
51+
///
52+
/// A better plan is to filter the data *before* the Sort, which sorts fewer
53+
/// rows and therefore does less work overall:
54+
///
55+
/// ```text
56+
/// Sort (a, b)
57+
/// Filter (a > 10) <-- Filter is moved before the sort
58+
/// ```
59+
///
60+
/// However it is not always possible to push filters down. For example, given a
61+
/// plan that finds the top 3 values and then keeps only those that are greater
62+
/// than 10, if the filter is pushed below the limit it would produce a
63+
/// different result.
64+
///
65+
/// ```text
66+
/// Filter (a > 10) <-- can not move this Filter before the limit
67+
/// Limit (fetch=3)
68+
/// Sort (a, b)
69+
/// ```
70+
///
71+
///
72+
/// More formally, a filter-commutative operation is an operation `op` that
73+
/// satisfies `filter(op(data)) = op(filter(data))`.
74+
///
75+
/// The filter-commutative property is plan and column-specific. A filter on `a`
76+
/// can be pushed through a `Aggregate(group_by = [a], agg=[SUM(b))`. However, a
77+
/// filter on `SUM(b)` can not be pushed through the same aggregate.
78+
///
79+
/// # Handling Conjunctions
80+
///
81+
/// It is possible to only push down **part** of a filter expression if is
82+
/// connected with `AND`s (more formally if it is a "conjunction").
83+
///
84+
/// For example, given the following plan:
85+
///
86+
/// ```text
87+
/// Filter(a > 10 AND SUM(b) < 5)
88+
/// Aggregate(group_by = [a], agg = [SUM(b))
89+
/// ```
90+
///
91+
/// The `a > 10` is commutative with the `Aggregate` but `SUM(b) < 5` is not.
92+
/// Therefore it is possible to only push part of the expression, resulting in:
93+
///
94+
/// ```text
95+
/// Filter(SUM(b) < 5)
96+
/// Aggregate(group_by = [a], agg = [SUM(b))
97+
/// Filter(a > 10)
98+
/// ```
99+
///
100+
/// # Handling Column Aliases
44101
///
45-
/// This optimizer commutes filters with filter-commutative operations to push the filters
46-
/// the closest possible to the scans, re-writing the filter expressions by every
47-
/// projection that changes the filter's expression.
102+
/// This optimizer must sometimes handle re-writing filter expressions when they
103+
/// pushed, for example if there is a projection that aliases `a+1` to `"b"`:
48104
///
49-
/// Filter: b Gt Int64(10)
50-
/// Projection: a AS b
105+
/// ```text
106+
/// Filter (b > 10)
107+
/// Projection: [a+1 AS "b"] <-- changes the name of `a+1` to `b`
108+
/// ```
51109
///
52-
/// is optimized to
110+
/// To apply the filter prior to the `Projection`, all references to `b` must be
111+
/// rewritten to `a+1`:
53112
///
54-
/// Projection: a AS b
55-
/// Filter: a Gt Int64(10) <--- changed from b to a
113+
/// ```text
114+
/// Projection: a AS "b"
115+
/// Filter: (a + 1 > 10) <--- changed from b to a + 1
116+
/// ```
117+
/// # Implementation Notes
56118
///
57-
/// This performs a single pass through the plan. When it passes through a filter, it stores that filter,
58-
/// and when it reaches a node that does not commute with it, it adds the filter to that place.
59-
/// When it passes through a projection, it re-writes the filter's expression taking into account that projection.
60-
/// When multiple filters would have been written, it `AND` their expressions into a single expression.
119+
/// This implementation performs a single pass through the plan, "pushing" down
120+
/// filters. When it passes through a filter, it stores that filter, and when it
121+
/// reaches a plan node that does not commute with that filter, it adds the
122+
/// filter to that place. When it passes through a projection, it re-writes the
123+
/// filter's expression taking into account that projection.
61124
#[derive(Default)]
62125
pub struct PushDownFilter {}
63126

0 commit comments

Comments
 (0)