Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32592][SQL] Make DataFrameReader.table take the specified options #29535

Closed
wants to merge 9 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable

Why are the changes needed?

Currently, DataFrameReader.table ignores the specified options. The options specified like the following are lost.

    val df = spark.read
      .option("partitionColumn", "id")
      .option("lowerBound", "0")
      .option("upperBound", "3")
      .option("numPartitions", "2")
      .table("h2.test.people")

We need to make DataFrameReader.table take the specified options.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually test for now. Will add a test after V2 JDBC read is implemented.

@SparkQA
Copy link

SparkQA commented Aug 25, 2020

Test build #127860 has finished for PR 29535 at commit b54cc55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

cc @MaxGekk @cloud-fan @viirya

@@ -822,6 +821,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
*/
def table(tableName: String): DataFrame = {
assertNoSpecifiedSchema("table")
for ((k, v) <- this.extraOptions)
sparkSession.conf.set(k, v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clearly define the behavior here.

I think the options don't make sense if we read temp view, view, or v1 tables in the session catalog. For v1 table, they are created with CREATE TABLE ... USING ... OPTIONS ..., so the options are already there and it's confusing to overwrite them per scan.

We do need to keep the scan options when reading v2 tables, but I don't think session conf is the right channel. I think we should put the options in UnresolvedRelation and apply these options when we resolve UnresolvedRelation to v2 relations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks for your comment.
I took a look at the code. I think I need to change the following:

Screen Shot 2020-08-25 at 6 03 49 PM

Screen Shot 2020-08-25 at 6 02 38 PM

Screen Shot 2020-08-25 at 6 02 50 PM

I need to modify lots of files because of the above changes. Could you please check if this looks OK to you before I continue? Thank you very much!

Copy link
Contributor

@cloud-fan cloud-fan Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We don't need to change SparkSession.table. We can create UnresolvedRelation in DataFrameReader.table directly.
  2. This is options to scan the v2 table, not v2 table properties. We don't need to change TableCatalog.loadTable, but pass the scan options to Table.newScanBuilder. DataSourceV2Relation already takes the scan options, we just need to create DataSourceV2Relation with scan options in ResolveTables and ResolveRelations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we already pass options from DataFrameReader to DataSourceV2Relation in load API. DataFrameReader.table misses this part.

@SparkQA
Copy link

SparkQA commented Aug 26, 2020

Test build #127942 has finished for PR 29535 at commit 05b55cb.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 27, 2020

Test build #127943 has finished for PR 29535 at commit 0bc7698.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

@cloud-fan @viirya Could you please check one more time? Thanks!

@@ -42,7 +42,9 @@ class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: Str
* @param multipartIdentifier table name
*/
case class UnresolvedRelation(
multipartIdentifier: Seq[String]) extends LeafNode with NamedRelation {
multipartIdentifier: Seq[String],
options: CaseInsensitiveMap[String] = CaseInsensitiveMap[String](Map.empty))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use CaseInsensitiveStringMap since it's only for v2?

sparkSession.table(tableName)
val multipartIdentifier =
sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName)
Dataset.ofRows(sparkSession, UnresolvedRelation(multipartIdentifier, extraOptions))
Copy link
Contributor

@cloud-fan cloud-fan Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should create CaseInsensitiveStringMap here:

new CaseInsensitiveStringMap(extraOptions.toMap.asJava)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toMap is required to get the original map (keys are not lowercased)

@@ -42,7 +42,9 @@ class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: Str
* @param multipartIdentifier table name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add options to param doc?

@huaxingao
Copy link
Contributor Author

Fixed. Thank you! @cloud-fan @viirya

@SparkQA
Copy link

SparkQA commented Aug 27, 2020

Test build #127948 has finished for PR 29535 at commit d56f92d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

private def lookupV2Relation(identifier: Seq[String]): Option[DataSourceV2Relation] =
private def lookupV2Relation(
identifier: Seq[String],
options: CaseInsensitiveStringMap = CaseInsensitiveStringMap.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the default value?

private def lookupRelation(identifier: Seq[String]): Option[LogicalPlan] = {
private def lookupRelation(
identifier: Seq[String],
options: CaseInsensitiveStringMap = CaseInsensitiveStringMap.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -40,9 +41,12 @@ class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: Str
* Holds the name of a relation that has yet to be looked up in a catalog.
*
* @param multipartIdentifier table name
* @param options options for this relation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options to scan this relation. Only applicable to v2 table scan.

@cloud-fan
Copy link
Contributor

LGTM except 2 minor comments

@SparkQA
Copy link

SparkQA commented Aug 27, 2020

Test build #127959 has finished for PR 29535 at commit a50bf2f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we don't have test for this change?

@SparkQA
Copy link

SparkQA commented Aug 27, 2020

Test build #127961 has finished for PR 29535 at commit a50bf2f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

it will be tested in the JDBC v2 PR, but yea, better to have a UT here with a fake v2 source.

@viirya
Copy link
Member

viirya commented Aug 27, 2020

@huaxingao You may need to sync with master.

@SparkQA
Copy link

SparkQA commented Aug 27, 2020

Test build #127962 has finished for PR 29535 at commit dcc88ca.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -40,7 +41,6 @@ import org.apache.spark.sql.execution.datasources.csv._
import org.apache.spark.sql.execution.datasources.jdbc._
import org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource
import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Relation, DataSourceV2Utils}
import org.apache.spark.sql.internal.SQLConf
Copy link
Member

@viirya viirya Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, @huaxingao, don't remove this. Other change uses SQLConf.

@SparkQA
Copy link

SparkQA commented Aug 27, 2020

Test build #127965 has finished for PR 29535 at commit f321560.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

We need to fix explain-aqe.sql

@huaxingao
Copy link
Contributor Author

For this explain formatted

EXPLAIN FORMATTED CREATE VIEW explain_view AS SELECT key, val FROM explain_temp1

Now it has

......

(3) UnresolvedRelation
Arguments: [explain_temp1], org.apache.spark.sql.util.CaseInsensitiveStringMap@1f

......

Do we just need to change the expected output to let the test pass? Or we need to fix the code to make the explain string to have more meaningful info than org.apache.spark.sql.util.CaseInsensitiveStringMap@1f?

@cloud-fan
Copy link
Contributor

@huaxingao good point. I think we should fix UnresolvedRelation to have a better format, by overriding verboseStringWithOperatorId

@huaxingao
Copy link
Contributor Author

I didn't run the whole sql test suites on my local. I will just take chance and test this on Jenkins. If the change causes other problems, I will fix tomorrow.

@SparkQA
Copy link

SparkQA commented Aug 28, 2020

Test build #127988 has finished for PR 29535 at commit 12d90bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -544,6 +545,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
case None => Nil
case Some(null) => Nil
case Some(any) => any :: Nil
case map: CaseInsensitiveStringMap =>
truncatedString(map.entrySet().toArray(), "[", ", ", "]", maxFields) :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this returns meaningful string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we should at least display the original map (keys are not lowercased)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use asCaseSensitiveMap call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the following

scala> Map("a" -> 1).asJava.entrySet.toArray.head
res1: Object = scala.collection.convert.Wrappers$MapWrapper$$anon$1$$anon$5$$anon$6@60

I think it's safer to generate the string key=value manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this:

scala> val options = Map(
     | "Key1" -> "Value1",
     | "key2" -> "vaue2"
     | )
options: scala.collection.immutable.Map[String,String] = Map(Key1 -> Value1, key2 -> vaue2)

scala> val map = new CaseInsensitiveStringMap(options.asJava)
map: org.apache.spark.sql.util.CaseInsensitiveStringMap = org.apache.spark.sql.util.CaseInsensitiveStringMap@9f174175

scala> map.asCaseSensitiveMap().entrySet().toArray()
res3: Array[Object] = Array(Key1=Value1, key2=vaue2)

I used case map: CaseInsensitiveStringMap for pattern matching, I guess we are probably OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case in ExplainSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do. Have a good night!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can't have a UnresolvedRelation with a non empty CaseInsensitiveStringMap in explain? The only way to create a UnresolvedRelation with a non empty CaseInsensitiveStringMap is using DataFrameReader.table, but this is resolved to a V2Relation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.explain(true) will explain the parsed plan, which contains UnresolvedRelation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test. Thank you!

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128077 has finished for PR 29535 at commit b371912.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128078 has finished for PR 29535 at commit a41c06a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128084 has finished for PR 29535 at commit a41c06a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 806140d Aug 31, 2020
@huaxingao
Copy link
Contributor Author

Thank you very much! @cloud-fan @viirya

@huaxingao huaxingao deleted the table_options branch August 31, 2020 16:02
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm too.

@HyukjinKwon
Copy link
Member

wait, I get confused here. We already defined a table with options. How does it work with the newly set options? are they merged?

@cloud-fan
Copy link
Contributor

it's table properties vs scan options

@HyukjinKwon
Copy link
Member

So UnresolvedReleation is shared for both cases but conditionally use the UnresolvedReleation.options only for Scan? that's very confusing.

@HyukjinKwon
Copy link
Member

this creates a myth that setting options will overwrite table properties. see also #34072

@cloud-fan
Copy link
Contributor

this creates a myth that setting options will overwrite table properties.

This is expected. Per-scan options have higher priority than table properties.

@@ -40,9 +41,12 @@ class UnresolvedException[TreeType <: TreeNode[_]](tree: TreeType, function: Str
* Holds the name of a relation that has yet to be looked up in a catalog.
*
* @param multipartIdentifier table name
* @param options options to scan this relation. Only applicable to v2 table scan.
Copy link
Member

@HyukjinKwon HyukjinKwon Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I just noticed 5e82548 added the merging behaviour for V1.

Okay, maybe we should fix the comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a followup to fix the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened a PR #34075

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I created a PR without seeing your comment. Thanks @huaxingao.

huaxingao pushed a commit that referenced this pull request Sep 23, 2021
…ation

### What changes were proposed in this pull request?

This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in #29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per #29712.

This PR can go through from `master` to `branch-3.1`.

### Why are the changes needed?

To make `UnresolvedRelation.options`'s description clearer.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Scala linter by `dev/linter-scala`.

Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
huaxingao pushed a commit that referenced this pull request Sep 23, 2021
…ation

### What changes were proposed in this pull request?

This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in #29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per #29712.

This PR can go through from `master` to `branch-3.1`.

### Why are the changes needed?

To make `UnresolvedRelation.options`'s description clearer.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Scala linter by `dev/linter-scala`.

Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
(cherry picked from commit 0076eba)
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
huaxingao pushed a commit that referenced this pull request Sep 23, 2021
…ation

### What changes were proposed in this pull request?

This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in #29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per #29712.

This PR can go through from `master` to `branch-3.1`.

### Why are the changes needed?

To make `UnresolvedRelation.options`'s description clearer.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Scala linter by `dev/linter-scala`.

Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
(cherry picked from commit 0076eba)
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants