Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Mar 8, 2016

What changes were proposed in this pull request?

This PR improve the codegen of Filter by:

  1. filter out the rows early if it have null value in it that will cause the condition result in null or false. After this, we could simplify the condition, because the input are not nullable anymore.
  2. Split the condition as conjunctive predicates, then check them one by one.

Here is a piece of generated code for Filter in TPCDS Q55:

/* 109 */       /*** CONSUME: Filter ((((isnotnull(d_moy#149) && isnotnull(d_year#147)) && (d_moy#149 = 11)) && (d_year#147 = 1999)) && isnotnull(d_date_sk#141)) */
/* 110 */       /* input[0, int] */
/* 111 */       boolean project_isNull2 = rdd_row.isNullAt(0);
/* 112 */       int project_value2 = project_isNull2 ? -1 : (rdd_row.getInt(0));
/* 113 */       /* input[1, int] */
/* 114 */       boolean project_isNull3 = rdd_row.isNullAt(1);
/* 115 */       int project_value3 = project_isNull3 ? -1 : (rdd_row.getInt(1));
/* 116 */       /* input[2, int] */
/* 117 */       boolean project_isNull4 = rdd_row.isNullAt(2);
/* 118 */       int project_value4 = project_isNull4 ? -1 : (rdd_row.getInt(2));
/* 119 */
/* 120 */       if (project_isNull3) continue;
/* 121 */       if (project_isNull4) continue;
/* 122 */       if (project_isNull2) continue;
/* 123 */
/* 124 */       /* (input[1, int] = 11) */
/* 125 */       boolean filter_value6 = false;
/* 126 */       filter_value6 = project_value3 == 11;
/* 127 */       if (!filter_value6) continue;
/* 128 */
/* 129 */       /* (input[2, int] = 1999) */
/* 130 */       boolean filter_value9 = false;
/* 131 */       filter_value9 = project_value4 == 1999;
/* 132 */       if (!filter_value9) continue;
/* 133 */
/* 134 */       filter_metricValue1.add(1);
/* 135 */
/* 136 */       /*** CONSUME: Project [d_date_sk#141] */
/* 137 */
/* 138 */       project_rowWriter1.write(0, project_value2);
/* 139 */       append(project_result1.copy());

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52686 has finished for PR 11585 at commit edc934f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Filter(condition: Expression, child: SparkPlan)

@nongli
Copy link
Contributor

nongli commented Mar 8, 2016

@sameeragarwal how does this relate to what you are working on?

@davies
Copy link
Contributor Author

davies commented Mar 9, 2016

@nongli It seems that once we have this, we don't need #11511 anymore, right?

@sameeragarwal
Copy link
Member

@nongli @davies I think part of the rationale for having #11511 was to avoid having to extract the operator's (and its child's) constraints during the physical planning/ codegen phase. Could we not just rewrite the code generated by filter such that all isNotNull attributes are converted to ${input(i).isNull? Along with #11372, #11511 would help us achieve the same goal.

@sameeragarwal
Copy link
Member

Also @davies, if you prefer your current approach, we then need to propagate the constraints in the physical plan as well (currently they are only propagated in the logical plan).

@davies
Copy link
Contributor Author

davies commented Mar 9, 2016

After offline discussion with @sameeragarwal , we agreed that 1) the constraints should only used in logical plan 2) we should use the IsNotNull() inserted by #11372, 3) we do not need to re-order the conditions anymore, just do that in Filter. 4) The IsNotNull from join ares pushed down into it's children, join requires nullability propagated from it's children, that's another story.

I will update this PR based on these, and revert #11511

cc @yhuai @nongli

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52757 has finished for PR 11585 at commit 4317853.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52780 has finished for PR 11585 at commit 2ec7f5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

""".stripMargin
}.mkString("\n")

// Reset the isNull to false for the not-null columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment why

Davies Liu added 2 commits March 10, 2016 13:22
@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52858 has finished for PR 11585 at commit b2906e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Mar 11, 2016

LGTM

@asfgit asfgit closed this in 020ff8c Mar 11, 2016
@davies
Copy link
Contributor Author

davies commented Mar 11, 2016

Merged into master

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This PR improve the codegen of Filter by:

1. filter out the rows early if it have null value in it that will cause the condition result in null or false. After this, we could simplify the condition, because the input are not nullable anymore.

2. Split the condition as conjunctive predicates, then check them one by one.

Here is a piece of generated code for Filter in TPCDS Q55:
```java
/* 109 */       /*** CONSUME: Filter ((((isnotnull(d_moy#149) && isnotnull(d_year#147)) && (d_moy#149 = 11)) && (d_year#147 = 1999)) && isnotnull(d_date_sk#141)) */
/* 110 */       /* input[0, int] */
/* 111 */       boolean project_isNull2 = rdd_row.isNullAt(0);
/* 112 */       int project_value2 = project_isNull2 ? -1 : (rdd_row.getInt(0));
/* 113 */       /* input[1, int] */
/* 114 */       boolean project_isNull3 = rdd_row.isNullAt(1);
/* 115 */       int project_value3 = project_isNull3 ? -1 : (rdd_row.getInt(1));
/* 116 */       /* input[2, int] */
/* 117 */       boolean project_isNull4 = rdd_row.isNullAt(2);
/* 118 */       int project_value4 = project_isNull4 ? -1 : (rdd_row.getInt(2));
/* 119 */
/* 120 */       if (project_isNull3) continue;
/* 121 */       if (project_isNull4) continue;
/* 122 */       if (project_isNull2) continue;
/* 123 */
/* 124 */       /* (input[1, int] = 11) */
/* 125 */       boolean filter_value6 = false;
/* 126 */       filter_value6 = project_value3 == 11;
/* 127 */       if (!filter_value6) continue;
/* 128 */
/* 129 */       /* (input[2, int] = 1999) */
/* 130 */       boolean filter_value9 = false;
/* 131 */       filter_value9 = project_value4 == 1999;
/* 132 */       if (!filter_value9) continue;
/* 133 */
/* 134 */       filter_metricValue1.add(1);
/* 135 */
/* 136 */       /*** CONSUME: Project [d_date_sk#141] */
/* 137 */
/* 138 */       project_rowWriter1.write(0, project_value2);
/* 139 */       append(project_result1.copy());
```

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes apache#11585 from davies/gen_filter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants