-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-18593][SQL] JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL #16021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… for JDBCRDD and add few filters This patch refactors the filter pushdown for JDBCRDD and also adds few filters. Added filters are basically from #10468 with some refactoring. Test cases are from #10468. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10470 from viirya/refactor-jdbc-filter. (cherry picked from commit ad5b7cf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ilter This PR is followed by #8391. Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC datasource. This PR fixes the problem that it can actually return null as a result of the comparison resulting error as using the value of that comparison. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #8743 from HyukjinKwon/SPARK-10180. (cherry picked from commit 94f7a12) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ng unnecessary Spark Filter Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10427 from maropu/RemoveFilterInJdbcScan. (cherry picked from commit 6f710f9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cc @maropu , @viirya , @HyukjinKwon , @rxin . This is a backport to resolve SPARK-18593 in Although this is a correctness issue, there is a workaround for this issue. Users can use TEXT or VARCHAR. In addition, I'm not sure we will have Apache Spark 1.6.4. I'm not sure if Please let me know your opinion. |
@@ -210,6 +220,16 @@ class JDBCSuite extends SparkFunSuite | |||
assert(df2.collect.toSet === Set(Row("mary", 2))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, some of original commit having WholeStageCodegenExec
is omitted during backport.
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L256-L266
Test build #69201 has finished for PR 16021 at commit
|
@dongjoon-hyun Thanks for letting me know! I'm not familiar with a backport policy in spark though, if we can, I think it'd be better to do because it seems there are still some of users use spark 1.6.x. By the way, do we need to remove the added spaces of PostgreSQL |
@@ -165,8 +165,64 @@ private[sql] object JDBCRDD extends Logging { | |||
* @return A Catalyst schema corresponding to columns in the given order. | |||
*/ | |||
private def pruneSchema(schema: StructType, columns: Array[String]): StructType = { | |||
val fieldMap = Map(schema.fields map { x => x.metadata.getString("name") -> x }: _*) | |||
new StructType(columns map { name => fieldMap(name) }) | |||
val fieldMap = Map(schema.fields.map(x => x.metadata.getString("name") -> x): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related?
} else "" | ||
} | ||
private val filterWhereClause: String = | ||
filters.map(JDBCRDD.compileFilter).flatten.mkString(" AND ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for cc'ing me @dongjoon-hyun. I believe we need to backport b22b20d too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. This is a correctness fixing. It may be more needed to backport.
I am fine with this in terms of the change I proposed in my PR before except that the comment above. |
@dongjoon-hyun Thanks for cc'ing me. Combined with b22b20d, I think it is worth backporting this to branch-1.6. |
Thank you, @maropu , @HyukjinKwon , @viirya . I'll update like that. @maropu . I think we don't need to remove that here at least for this issue. We don't expect Apache Spark 1.6.4, but at least this PR presents a solution for community using Apache Spark 1.6.x. Let's wait together for a community decision. |
Test build #69206 has finished for PR 16021 at commit
|
Oops. I didn't replace |
Test build #69207 has finished for PR 16021 at commit
|
…edence This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer. **Case 1:** For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected. **Case 2:** For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause: ``` spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...) ``` Unit test. This PR also close #13640 Author: hyukjinkwon <gurwls223@gmail.com> Author: Sean Zhong <seanzhong@databricks.com> Closes #13743 from clockfly/SPARK-15916. (cherry picked from commit ebb9a3b) Signed-off-by: Cheng Lian <lian@databricks.com> (cherry picked from commit b22b20d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Test build #69208 has finished for PR 16021 at commit
|
Actually @dongjoon-hyun, this is a rather a question. Should we include other JIRAs in the title which this PR includes to backport or would this be enough if those are described in this JIRA, SPARK-18593? |
Only your commits will be in And, only If they allow this to land on |
Hmm. Ur, @HyukjinKwon , maybe, the Spark script might not work like that I mentioned. Anyway, in that case, the committer should backport these 6 commits separately instead of this PR. For this issue, I think we can talk about later (when it's needed). What I want is just having these 6 commits separately in |
( I was just curious as I just remember tracking down with the blame button. It is no strong opinion.) |
Hi, @rxin , @gatorsmile , @hvanhovell , @srowen . Could you give some opinion? What I hoped here is having these 6 commits in But, if backports are not proper for this issue by the Apache Spark policy, I will just mark SPARK-18593 as Resolved with a fix version 2.0.0 and close this PR. I know that It's not a good time for you to see this. Sorry for asking this. |
Do these 6 commits have any conflicts? If no I can just cherrypick them. |
Oh, thank you for considering. There are a few trivial conflicts. Also, the last commit have new APIs which are needed to be replaced into old ones. |
If you don't mind, it'd be better to have separate pull requests so we can link them easily across branches. |
Sure! I think the authors also agree with that if it's allowed. Hi, @maropu , @HyukjinKwon , @viirya . Thank you for the decision, @rxin . |
If you can tell me the ones that don't conflict, I can just cherry-pick them without pull requests. You can also submit pull requests for those that conflict btw. Don't need the original authors to. |
uh, we are fixing another bug in this area. See the PR: #15662 |
That seems to be for master (2.2)? |
I see, @rxin. I will split them and tell you. Also I will make separate PRs keeping the original author. I'll ping you a few hour later! |
Hi, @rxin . All 6 commits are not cherry-pickable. Especially, the last 2 commits have inevitable conflicts due to some wide commits (about But, I can find the following clean minimal cherry-pick sequences for the first 4 commits.
All are related commits in the above three files. If you don't mind, could you cherry-pick the above. Sorry for the increasing backport targets. I thought clean cherry-pick is more important than the number of commits. The followings are the titles of them.
After your cherry-picking, I will create two PRs for the remaining inevitable-conflict commits. |
Or, may I create just 6 PRs sequentially? |
Ping, @rxin . :) |
I can help you review the backported PRs. Please backport the PRs one by one and cc me. |
Thank you, @gatorsmile. I will create like that! |
Thank you all so much for giving your attentions and opinions here. I asked the one which is not allowed here. Please refer #16127 for more details, too. I'm closing this now. Sorry for all your inconvenience. |
What changes were proposed in this pull request?
In Apache Spark 1.6.x, JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL. The root cause is that PostgreSQL returns space padded string as a result. So, the post processing filter
Filter (a#0 = A)
is evaluated false. Spark 2.0.0 removes the post filter because it is already handled inside the database byPushedFilters: [EqualTo(a,A)]
. This PR backports the relevant commits to fix this problem in Spark 1.6.x. All credits should go to the original authors.PostgreSQL Table & Query
Spark 1.6.3 Result
How was this patch tested?
N/A (This is a backport of the commits).