Skip to content

[SPARK-18593][SQL] JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL #16021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 27, 2016

What changes were proposed in this pull request?

In Apache Spark 1.6.x, JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL. The root cause is that PostgreSQL returns space padded string as a result. So, the post processing filter Filter (a#0 = A) is evaluated false. Spark 2.0.0 removes the post filter because it is already handled inside the database by PushedFilters: [EqualTo(a,A)]. This PR backports the relevant commits to fix this problem in Spark 1.6.x. All credits should go to the original authors.

PostgreSQL Table & Query

postgres=# \d t_char
       Table "public.t_char"
 Column |     Type      | Modifiers 
--------+---------------+-----------
 a      | character(10) | 

postgres=# \d t_varchar
          Table "public.t_varchar"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 a      | character varying(10) | 

postgres=# select * from t_char where a='A';  // it works
     a      
------------
 A         
(1 row)

postgres=# select * from t_char where a='A         ';
     a      
------------
 A         
(1 row)

Spark 1.6.3 Result

scala> val t_char = sqlContext.read.option("user", "postgres").option("password", "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_char", new java.util.Properties())
t_char: org.apache.spark.sql.DataFrame = [a: string]

scala> val t_varchar = sqlContext.read.option("user", "postgres").option("password", "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_varchar", new java.util.Properties())
t_varchar: org.apache.spark.sql.DataFrame = [a: string]

scala> t_char.show
+----------+
|         a|
+----------+
|A         |
|AA        |
|AAA       |
+----------+

scala> t_varchar.show
+---+
|  a|
+---+
|  A|
| AA|
|AAA|
+---+

scala> t_char.filter(t_char("a")==="A").show  // This is the wrong result
+---+
|  a|
+---+
+---+

scala> t_char.filter(t_char("a")==="A         ").show
+----------+
|         a|
+----------+
|A         |
+----------+

scala> t_varchar.filter(t_varchar("a")==="A").show
+---+
|  a|
+---+
|  A|
+---+

scala> t_char.filter(t_char("a")==="A").explain
== Physical Plan ==
Filter (a#0 = A)
+- Scan JDBCRelation(jdbc:postgresql://localhost:5432/postgres,t_char,[Lorg.apache.spark.Partition;@2f65c341,{user=postgres, password=rootpass})[a#0] PushedFilters: [EqualTo(a,A)]

How was this patch tested?

N/A (This is a backport of the commits).

maropu and others added 5 commits November 27, 2016 02:34
No tests done for JDBCRDD#compileFilter.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10409 from maropu/AddTestsInJdbcRdd.

(cherry picked from commit 8c1b867)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ush-down filters for JDBC

This is rework from #10386 and add more tests and LIKE push-down support.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10468 from maropu/SupportMorePushdownInJdbc.

(cherry picked from commit 5c2682b)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… for JDBCRDD and add few filters

This patch refactors the filter pushdown for JDBCRDD and also adds few filters.

Added filters are basically from #10468 with some refactoring. Test cases are from #10468.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10470 from viirya/refactor-jdbc-filter.

(cherry picked from commit ad5b7cf)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ilter

This PR is followed by #8391.
Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC datasource. This PR fixes the problem that it can actually return null as a result of the comparison resulting error as using the value of that comparison.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: HyukjinKwon <gurwls223@gmail.com>

Closes #8743 from HyukjinKwon/SPARK-10180.

(cherry picked from commit 94f7a12)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ng unnecessary Spark Filter

Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

Current plan:
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```

This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})

== Physical Plan ==
Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10427 from maropu/RemoveFilterInJdbcScan.

(cherry picked from commit 6f710f9)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member Author

cc @maropu , @viirya , @HyukjinKwon , @rxin .

This is a backport to resolve SPARK-18593 in branch-1.6.

Although this is a correctness issue, there is a workaround for this issue. Users can use TEXT or VARCHAR. In addition, I'm not sure we will have Apache Spark 1.6.4.

I'm not sure if branch-1.6 can have these kind of commits to fix an issue.

Please let me know your opinion.

@dongjoon-hyun dongjoon-hyun changed the title JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL [SPARK-18593][SQL] JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL Nov 27, 2016
@@ -210,6 +220,16 @@ class JDBCSuite extends SparkFunSuite
assert(df2.collect.toSet === Set(Row("mary", 2)))
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Nov 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, some of original commit having WholeStageCodegenExec is omitted during backport.
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L256-L266

@SparkQA
Copy link

SparkQA commented Nov 27, 2016

Test build #69201 has finished for PR 16021 at commit 59b7e4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Nov 27, 2016

@dongjoon-hyun Thanks for letting me know! I'm not familiar with a backport policy in spark though, if we can, I think it'd be better to do because it seems there are still some of users use spark 1.6.x. By the way, do we need to remove the added spaces of PostgreSQL CHAR in JdbcDialect or something when loading these data into Spark? ISTM that this PostgreSQL-specific behaivor makes users confused when processing the loaded data in Spark.

@@ -165,8 +165,64 @@ private[sql] object JDBCRDD extends Logging {
* @return A Catalyst schema corresponding to columns in the given order.
*/
private def pruneSchema(schema: StructType, columns: Array[String]): StructType = {
val fieldMap = Map(schema.fields map { x => x.metadata.getString("name") -> x }: _*)
new StructType(columns map { name => fieldMap(name) })
val fieldMap = Map(schema.fields.map(x => x.metadata.getString("name") -> x): _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related?

} else ""
}
private val filterWhereClause: String =
filters.map(JDBCRDD.compileFilter).flatten.mkString(" AND ")
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for cc'ing me @dongjoon-hyun. I believe we need to backport b22b20d too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. This is a correctness fixing. It may be more needed to backport.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 27, 2016

I am fine with this in terms of the change I proposed in my PR before except that the comment above.

@viirya
Copy link
Member

viirya commented Nov 27, 2016

@dongjoon-hyun Thanks for cc'ing me. Combined with b22b20d, I think it is worth backporting this to branch-1.6.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 27, 2016

Thank you, @maropu , @HyukjinKwon , @viirya . I'll update like that.

@maropu . I think we don't need to remove that here at least for this issue.

We don't expect Apache Spark 1.6.4, but at least this PR presents a solution for community using Apache Spark 1.6.x.

Let's wait together for a community decision.

@SparkQA
Copy link

SparkQA commented Nov 27, 2016

Test build #69206 has finished for PR 16021 at commit c896231.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Oops. I didn't replace spark with sqlContext. I'll fix soon.

@SparkQA
Copy link

SparkQA commented Nov 27, 2016

Test build #69207 has finished for PR 16021 at commit 96ae0da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…edence

This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer.

**Case 1:**

For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected.

**Case 2:**

For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause:
```
spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...)
```

Unit test.

This PR also close #13640

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Sean Zhong <seanzhong@databricks.com>

Closes #13743 from clockfly/SPARK-15916.

(cherry picked from commit ebb9a3b)
Signed-off-by: Cheng Lian <lian@databricks.com>
(cherry picked from commit b22b20d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@SparkQA
Copy link

SparkQA commented Nov 27, 2016

Test build #69208 has finished for PR 16021 at commit cf8fd94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 27, 2016

Actually @dongjoon-hyun, this is a rather a question. Should we include other JIRAs in the title which this PR includes to backport or would this be enough if those are described in this JIRA, SPARK-18593?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 27, 2016

Only your commits will be in branch-1.6 with the original titles.
Since this PR has nothing, this PR will disappear completely when the committer merge this. I think so.

And, only If they allow this to land on branch-1.6. :)

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 27, 2016

Hmm. Ur, @HyukjinKwon , maybe, the Spark script might not work like that I mentioned.

Anyway, in that case, the committer should backport these 6 commits separately instead of this PR.

For this issue, I think we can talk about later (when it's needed). What I want is just having these 6 commits separately in branch-1.6 if possible.

@HyukjinKwon
Copy link
Member

( I was just curious as I just remember tracking down with the blame button. It is no strong opinion.)

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin , @gatorsmile , @hvanhovell , @srowen .

Could you give some opinion? What I hoped here is having these 6 commits in branch-1.6.

But, if backports are not proper for this issue by the Apache Spark policy, I will just mark SPARK-18593 as Resolved with a fix version 2.0.0 and close this PR.

I know that It's not a good time for you to see this. Sorry for asking this.

@rxin
Copy link
Contributor

rxin commented Nov 30, 2016

Do these 6 commits have any conflicts? If no I can just cherrypick them.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 30, 2016

Oh, thank you for considering. There are a few trivial conflicts.
One is the following about WholeStageCodegenExec testcase which I omitted here.

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L256-L266

Also, the last commit have new APIs which are needed to be replaced into old ones.

@rxin
Copy link
Contributor

rxin commented Nov 30, 2016

If you don't mind, it'd be better to have separate pull requests so we can link them easily across branches.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 30, 2016

Sure! I think the authors also agree with that if it's allowed.

Hi, @maropu , @HyukjinKwon , @viirya .
Could you proceed to backport yours to branch-1.6?
If you're busy, I can make for you sequentially, also.

Thank you for the decision, @rxin .

@rxin
Copy link
Contributor

rxin commented Nov 30, 2016

If you can tell me the ones that don't conflict, I can just cherry-pick them without pull requests.

You can also submit pull requests for those that conflict btw. Don't need the original authors to.

@gatorsmile
Copy link
Member

uh, we are fixing another bug in this area. See the PR: #15662

@dongjoon-hyun
Copy link
Member Author

That seems to be for master (2.2)?

@dongjoon-hyun
Copy link
Member Author

I see, @rxin. I will split them and tell you. Also I will make separate PRs keeping the original author.

I'll ping you a few hour later!

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Dec 1, 2016

Hi, @rxin .

All 6 commits are not cherry-pickable. Especially, the last 2 commits have inevitable conflicts due to some wide commits (about import order and syntax cleanups) and API changes.

But, I can find the following clean minimal cherry-pick sequences for the first 4 commits.

$ git cherry-pick -x 28112657ea5919451291c21b4b8e1eb3db0ec8d4
$ git cherry-pick -x 0f6936b5f1c9b0be1c33b98ffb62a72ae0c3e2a8
$ git cherry-pick -x 7f443a6879fa33ca8adb682bd85df2d56fb5fcda
$ git cherry-pick -x 2aad2d372469aaf2773876cae98ef002fef03aa3
$ git cherry-pick -x 554d840a9ade79722c96972257435a05e2aa9d88
$ git cherry-pick -x 8c1b867cee816d0943184c7b485cd11e255d8130
$ git cherry-pick -x 5c2682b0c8fd2aeae2af1adb716ee0d5f8b85135
$ git cherry-pick -x ad5b7cfcca7a5feb83b9ed94b6e725c6d789579b
$ git cherry-pick -x 94f7a12b3c8e4a6ecd969893e562feb7ffba4c24
$ git diff HEAD~9 --stat
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala | 102 +++++++++++++++++++++++++++++++++++---------------------
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala                     |  68 ++++++++++++++++++++++++++++++++-----
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala                  |  15 +++++++++
 3 files changed, 139 insertions(+), 46 deletions(-)

All are related commits in the above three files. If you don't mind, could you cherry-pick the above. Sorry for the increasing backport targets. I thought clean cherry-pick is more important than the number of commits.

The followings are the titles of them.

  1. [SPARK-12236][SQL] JDBC filter tests all pass if filters are not really pushed down.
  2. [SPARK-12249][SQL] JDBC non-equality comparison operator not pushed down.
  3. [SPARK-12314][SQL] isnull operator not pushed down for JDBC datasource.
  4. [SPARK-12315][SQL] isnotnull operator not pushed down for JDBC datasource.
  5. Style fix for the previous 3 JDBC filter push down commits. (@rxin)
  6. [SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions
  7. [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE push-down filters for JDBC
  8. [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Refactor filter pushdown for JDBCRDD and add few filters (Liang-Chi Hsieh)
  9. [SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe filter

After your cherry-picking, I will create two PRs for the remaining inevitable-conflict commits.

@dongjoon-hyun
Copy link
Member Author

Or, may I create just 6 PRs sequentially?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Dec 2, 2016

Ping, @rxin . :)

@gatorsmile
Copy link
Member

I can help you review the backported PRs. Please backport the PRs one by one and cc me.

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile. I will create like that!

@dongjoon-hyun
Copy link
Member Author

Thank you all so much for giving your attentions and opinions here.

I asked the one which is not allowed here. Please refer #16127 for more details, too.

I'm closing this now.

Sorry for all your inconvenience.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-18593 branch January 7, 2019 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants