[SPARK-18593][SQL] JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL #16021

dongjoon-hyun · 2016-11-27T11:05:38Z

What changes were proposed in this pull request?

In Apache Spark 1.6.x, JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL. The root cause is that PostgreSQL returns space padded string as a result. So, the post processing filter Filter (a#0 = A) is evaluated false. Spark 2.0.0 removes the post filter because it is already handled inside the database by PushedFilters: [EqualTo(a,A)]. This PR backports the relevant commits to fix this problem in Spark 1.6.x. All credits should go to the original authors.

PostgreSQL Table & Query

postgres=# \d t_char
       Table "public.t_char"
 Column |     Type      | Modifiers 
--------+---------------+-----------
 a      | character(10) | 

postgres=# \d t_varchar
          Table "public.t_varchar"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 a      | character varying(10) | 

postgres=# select * from t_char where a='A';  // it works
     a      
------------
 A         
(1 row)

postgres=# select * from t_char where a='A         ';
     a      
------------
 A         
(1 row)

Spark 1.6.3 Result

scala> val t_char = sqlContext.read.option("user", "postgres").option("password", "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_char", new java.util.Properties())
t_char: org.apache.spark.sql.DataFrame = [a: string]

scala> val t_varchar = sqlContext.read.option("user", "postgres").option("password", "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_varchar", new java.util.Properties())
t_varchar: org.apache.spark.sql.DataFrame = [a: string]

scala> t_char.show
+----------+
|         a|
+----------+
|A         |
|AA        |
|AAA       |
+----------+

scala> t_varchar.show
+---+
|  a|
+---+
|  A|
| AA|
|AAA|
+---+

scala> t_char.filter(t_char("a")==="A").show  // This is the wrong result
+---+
|  a|
+---+
+---+

scala> t_char.filter(t_char("a")==="A         ").show
+----------+
|         a|
+----------+
|A         |
+----------+

scala> t_varchar.filter(t_varchar("a")==="A").show
+---+
|  a|
+---+
|  A|
+---+

scala> t_char.filter(t_char("a")==="A").explain
== Physical Plan ==
Filter (a#0 = A)
+- Scan JDBCRelation(jdbc:postgresql://localhost:5432/postgres,t_char,[Lorg.apache.spark.Partition;@2f65c341,{user=postgres, password=rootpass})[a#0] PushedFilters: [EqualTo(a,A)]

How was this patch tested?

N/A (This is a backport of the commits).

No tests done for JDBCRDD#compileFilter. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10409 from maropu/AddTestsInJdbcRdd. (cherry picked from commit 8c1b867) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ush-down filters for JDBC This is rework from #10386 and add more tests and LIKE push-down support. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10468 from maropu/SupportMorePushdownInJdbc. (cherry picked from commit 5c2682b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… for JDBCRDD and add few filters This patch refactors the filter pushdown for JDBCRDD and also adds few filters. Added filters are basically from #10468 with some refactoring. Test cases are from #10468. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10470 from viirya/refactor-jdbc-filter. (cherry picked from commit ad5b7cf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ilter This PR is followed by #8391. Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC datasource. This PR fixes the problem that it can actually return null as a result of the comparison resulting error as using the value of that comparison. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #8743 from HyukjinKwon/SPARK-10180. (cherry picked from commit 94f7a12) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ng unnecessary Spark Filter Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10427 from maropu/RemoveFilterInJdbcScan. (cherry picked from commit 6f710f9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2016-11-27T11:10:13Z

cc @maropu , @viirya , @HyukjinKwon , @rxin .

This is a backport to resolve SPARK-18593 in branch-1.6.

Although this is a correctness issue, there is a workaround for this issue. Users can use TEXT or VARCHAR. In addition, I'm not sure we will have Apache Spark 1.6.4.

I'm not sure if branch-1.6 can have these kind of commits to fix an issue.

Please let me know your opinion.

dongjoon-hyun · 2016-11-27T11:33:27Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

@@ -210,6 +220,16 @@ class JDBCSuite extends SparkFunSuite
    assert(df2.collect.toSet === Set(Row("mary", 2)))


Here, some of original commit having WholeStageCodegenExec is omitted during backport.
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L256-L266

SparkQA · 2016-11-27T12:37:18Z

Test build #69201 has finished for PR 16021 at commit 59b7e4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-11-27T12:48:44Z

@dongjoon-hyun Thanks for letting me know! I'm not familiar with a backport policy in spark though, if we can, I think it'd be better to do because it seems there are still some of users use spark 1.6.x. By the way, do we need to remove the added spaces of PostgreSQL CHAR in JdbcDialect or something when loading these data into Spark? ISTM that this PostgreSQL-specific behaivor makes users confused when processing the loaded data in Spark.

viirya · 2016-11-27T13:58:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

@@ -165,8 +165,64 @@ private[sql] object JDBCRDD extends Logging {
   * @return A Catalyst schema corresponding to columns in the given order.
   */
  private def pruneSchema(schema: StructType, columns: Array[String]): StructType = {
-    val fieldMap = Map(schema.fields map { x => x.metadata.getString("name") -> x }: _*)
-    new StructType(columns map { name => fieldMap(name) })
+    val fieldMap = Map(schema.fields.map(x => x.metadata.getString("name") -> x): _*)


Is this change related?

HyukjinKwon · 2016-11-27T14:25:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

-    } else ""
-  }
+  private val filterWhereClause: String =
+    filters.map(JDBCRDD.compileFilter).flatten.mkString(" AND ")


Thank you for cc'ing me @dongjoon-hyun. I believe we need to backport b22b20d too.

This is good. This is a correctness fixing. It may be more needed to backport.

HyukjinKwon · 2016-11-27T14:28:05Z

I am fine with this in terms of the change I proposed in my PR before except that the comment above.

viirya · 2016-11-27T14:34:35Z

@dongjoon-hyun Thanks for cc'ing me. Combined with b22b20d, I think it is worth backporting this to branch-1.6.

dongjoon-hyun · 2016-11-27T19:20:30Z

Thank you, @maropu , @HyukjinKwon , @viirya . I'll update like that.

@maropu . I think we don't need to remove that here at least for this issue.

We don't expect Apache Spark 1.6.4, but at least this PR presents a solution for community using Apache Spark 1.6.x.

Let's wait together for a community decision.

SparkQA · 2016-11-27T19:35:30Z

Test build #69206 has finished for PR 16021 at commit c896231.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-11-27T19:43:10Z

Oops. I didn't replace spark with sqlContext. I'll fix soon.

SparkQA · 2016-11-27T20:20:57Z

Test build #69207 has finished for PR 16021 at commit 96ae0da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…edence This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer. **Case 1:** For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected. **Case 2:** For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause: ``` spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...) ``` Unit test. This PR also close #13640 Author: hyukjinkwon <gurwls223@gmail.com> Author: Sean Zhong <seanzhong@databricks.com> Closes #13743 from clockfly/SPARK-15916. (cherry picked from commit ebb9a3b) Signed-off-by: Cheng Lian <lian@databricks.com> (cherry picked from commit b22b20d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

SparkQA · 2016-11-27T22:22:03Z

Test build #69208 has finished for PR 16021 at commit cf8fd94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-27T23:24:54Z

Actually @dongjoon-hyun, this is a rather a question. Should we include other JIRAs in the title which this PR includes to backport or would this be enough if those are described in this JIRA, SPARK-18593?

dongjoon-hyun · 2016-11-27T23:29:11Z

Only your commits will be in branch-1.6 with the original titles.
Since this PR has nothing, this PR will disappear completely when the committer merge this. I think so.

And, only If they allow this to land on branch-1.6. :)

dongjoon-hyun · 2016-11-27T23:42:05Z

Hmm. Ur, @HyukjinKwon , maybe, the Spark script might not work like that I mentioned.

Anyway, in that case, the committer should backport these 6 commits separately instead of this PR.

For this issue, I think we can talk about later (when it's needed). What I want is just having these 6 commits separately in branch-1.6 if possible.

HyukjinKwon · 2016-11-27T23:45:47Z

( I was just curious as I just remember tracking down with the blame button. It is no strong opinion.)

dongjoon-hyun · 2016-11-30T21:21:21Z

Hi, @rxin , @gatorsmile , @hvanhovell , @srowen .

Could you give some opinion? What I hoped here is having these 6 commits in branch-1.6.

But, if backports are not proper for this issue by the Apache Spark policy, I will just mark SPARK-18593 as Resolved with a fix version 2.0.0 and close this PR.

I know that It's not a good time for you to see this. Sorry for asking this.

rxin · 2016-11-30T21:24:29Z

Do these 6 commits have any conflicts? If no I can just cherrypick them.

dongjoon-hyun · 2016-11-30T21:30:50Z

Oh, thank you for considering. There are a few trivial conflicts.
One is the following about WholeStageCodegenExec testcase which I omitted here.

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L256-L266

Also, the last commit have new APIs which are needed to be replaced into old ones.

rxin · 2016-11-30T21:33:10Z

If you don't mind, it'd be better to have separate pull requests so we can link them easily across branches.

dongjoon-hyun · 2016-11-30T21:34:38Z

Sure! I think the authors also agree with that if it's allowed.

Hi, @maropu , @HyukjinKwon , @viirya .
Could you proceed to backport yours to branch-1.6?
If you're busy, I can make for you sequentially, also.

Thank you for the decision, @rxin .

rxin · 2016-11-30T21:38:10Z

If you can tell me the ones that don't conflict, I can just cherry-pick them without pull requests.

You can also submit pull requests for those that conflict btw. Don't need the original authors to.

gatorsmile · 2016-11-30T21:40:55Z

uh, we are fixing another bug in this area. See the PR: #15662

dongjoon-hyun · 2016-11-30T21:45:03Z

That seems to be for master (2.2)?

dongjoon-hyun · 2016-11-30T21:47:51Z

I see, @rxin. I will split them and tell you. Also I will make separate PRs keeping the original author.

I'll ping you a few hour later!

dongjoon-hyun · 2016-12-01T04:56:34Z

Hi, @rxin .

All 6 commits are not cherry-pickable. Especially, the last 2 commits have inevitable conflicts due to some wide commits (about import order and syntax cleanups) and API changes.

But, I can find the following clean minimal cherry-pick sequences for the first 4 commits.

$ git cherry-pick -x 28112657ea5919451291c21b4b8e1eb3db0ec8d4
$ git cherry-pick -x 0f6936b5f1c9b0be1c33b98ffb62a72ae0c3e2a8
$ git cherry-pick -x 7f443a6879fa33ca8adb682bd85df2d56fb5fcda
$ git cherry-pick -x 2aad2d372469aaf2773876cae98ef002fef03aa3
$ git cherry-pick -x 554d840a9ade79722c96972257435a05e2aa9d88
$ git cherry-pick -x 8c1b867cee816d0943184c7b485cd11e255d8130
$ git cherry-pick -x 5c2682b0c8fd2aeae2af1adb716ee0d5f8b85135
$ git cherry-pick -x ad5b7cfcca7a5feb83b9ed94b6e725c6d789579b
$ git cherry-pick -x 94f7a12b3c8e4a6ecd969893e562feb7ffba4c24
$ git diff HEAD~9 --stat
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala | 102 +++++++++++++++++++++++++++++++++++---------------------
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala                     |  68 ++++++++++++++++++++++++++++++++-----
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala                  |  15 +++++++++
 3 files changed, 139 insertions(+), 46 deletions(-)

All are related commits in the above three files. If you don't mind, could you cherry-pick the above. Sorry for the increasing backport targets. I thought clean cherry-pick is more important than the number of commits.

The followings are the titles of them.

[SPARK-12236][SQL] JDBC filter tests all pass if filters are not really pushed down.
[SPARK-12249][SQL] JDBC non-equality comparison operator not pushed down.
[SPARK-12314][SQL] isnull operator not pushed down for JDBC datasource.
[SPARK-12315][SQL] isnotnull operator not pushed down for JDBC datasource.
Style fix for the previous 3 JDBC filter push down commits. (@rxin)
[SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions
[SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE push-down filters for JDBC
[SPARK-12409][SPARK-12387][SPARK-12391][SQL] Refactor filter pushdown for JDBCRDD and add few filters (Liang-Chi Hsieh)
[SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe filter

After your cherry-picking, I will create two PRs for the remaining inevitable-conflict commits.

dongjoon-hyun · 2016-12-01T17:26:53Z

Or, may I create just 6 PRs sequentially?

dongjoon-hyun · 2016-12-02T16:02:50Z

Ping, @rxin . :)

gatorsmile · 2016-12-02T20:28:06Z

I can help you review the backported PRs. Please backport the PRs one by one and cc me.

dongjoon-hyun · 2016-12-02T20:43:01Z

Thank you, @gatorsmile. I will create like that!

dongjoon-hyun · 2016-12-03T05:16:10Z

Thank you all so much for giving your attentions and opinions here.

I asked the one which is not allowed here. Please refer #16127 for more details, too.

I'm closing this now.

Sorry for all your inconvenience.

maropu and others added 5 commits November 27, 2016 02:34

dongjoon-hyun changed the title ~~JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL~~ [SPARK-18593][SQL] JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL Nov 27, 2016

dongjoon-hyun commented Nov 27, 2016

View reviewed changes

viirya reviewed Nov 27, 2016

View reviewed changes

HyukjinKwon reviewed Nov 27, 2016

View reviewed changes

This was referenced Dec 3, 2016

[SPARK-12446][SQL][BACKPORT-1.6] Add unit tests for JDBCRDD internal functions #16124

Closed

[SPARK-12409][SPARK-12387][SPARK-12391][SQL][BACKPORT-1.6] Support AND/OR/IN/LIKE push-down filters for JDBC #16127

Closed

dongjoon-hyun closed this Dec 3, 2016

dongjoon-hyun deleted the SPARK-18593 branch January 7, 2019 07:03

		@@ -210,6 +220,16 @@ class JDBCSuite extends SparkFunSuite
		assert(df2.collect.toSet === Set(Row("mary", 2)))

[SPARK-18593][SQL] JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL #16021

[SPARK-18593][SQL] JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL #16021

Uh oh!

Conversation

dongjoon-hyun commented Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Nov 27, 2016

Uh oh!

dongjoon-hyun Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 27, 2016

Uh oh!

maropu commented Nov 27, 2016

Uh oh!

viirya Nov 27, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 27, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Nov 27, 2016

Uh oh!

dongjoon-hyun commented Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 27, 2016

Uh oh!

dongjoon-hyun commented Nov 27, 2016

Uh oh!

SparkQA commented Nov 27, 2016

Uh oh!

SparkQA commented Nov 27, 2016

Uh oh!

HyukjinKwon commented Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Nov 27, 2016

Uh oh!

dongjoon-hyun commented Nov 30, 2016

Uh oh!

rxin commented Nov 30, 2016

Uh oh!

dongjoon-hyun commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Nov 30, 2016

Uh oh!

dongjoon-hyun commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Nov 30, 2016

Uh oh!

gatorsmile commented Nov 30, 2016

Uh oh!

dongjoon-hyun commented Nov 30, 2016

Uh oh!

dongjoon-hyun commented Nov 30, 2016

Uh oh!

dongjoon-hyun commented Dec 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Dec 1, 2016

dongjoon-hyun commented Nov 27, 2016 •

edited

Loading

dongjoon-hyun Nov 27, 2016 •

edited

Loading

HyukjinKwon Nov 27, 2016 •

edited

Loading

HyukjinKwon commented Nov 27, 2016 •

edited

Loading

dongjoon-hyun commented Nov 27, 2016 •

edited

Loading

HyukjinKwon commented Nov 27, 2016 •

edited

Loading

dongjoon-hyun commented Nov 27, 2016 •

edited

Loading

dongjoon-hyun commented Nov 27, 2016 •

edited

Loading

dongjoon-hyun commented Nov 30, 2016 •

edited

Loading

dongjoon-hyun commented Nov 30, 2016 •

edited

Loading

dongjoon-hyun commented Dec 1, 2016 •

edited

Loading

dongjoon-hyun commented Dec 2, 2016 •

edited

Loading