Skip to content

[SPARK-16401] [SQL] Data Source API: Enable Extending RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider #14075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jul 6, 2016

What changes were proposed in this pull request?

When users try to implement a data source API with extending only RelationProvider and CreatableRelationProvider, they will hit an error when resolving the relation.

spark.read
.format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .load()
  .write.
format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .save()

The error they hit is like

org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)

Actually, the bug fix is simple. DataSource.createRelation(sparkSession.sqlContext, mode, options, data) already returns a BaseRelation. We should not assign schema to userSpecifiedSchema. That schema assignment only makes sense for the data sources that extend FileFormat.

How was this patch tested?

Added a test case.

@gatorsmile gatorsmile changed the title Data Source API: Extend RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider [SPARK-16401] [SQL] Data Source API: Extend RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider Jul 6, 2016
@gatorsmile gatorsmile changed the title [SPARK-16401] [SQL] Data Source API: Extend RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider [SPARK-16401] [SQL] Data Source API: Enable Extending RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider Jul 6, 2016
@gatorsmile
Copy link
Member Author

@rxin @marmbrus @yhuai Not sure whether we should backport to 2.0. Personally, this sounds a critical bug to me.

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

Is this a regression?

@gatorsmile
Copy link
Member Author

Will make a try tonight. My environment for 1.6 is at home.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61864 has finished for PR 14075 at commit dd644f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JustinPihony
Copy link
Contributor

@rxin It does look like this might have been a regression introduced via the initial creation of DataSource from ResolvedDataSource. Although, @marmbrus could speak to it better as he was the one who wrote that code. Maybe there was a reason?

@gatorsmile
Copy link
Member Author

@rxin This is a regression. I did try it in Spark 1.6. It works well. I think we need to fix it in Spark 2.0

Thanks!

@rxin
Copy link
Contributor

rxin commented Jul 7, 2016

Thanks - then we should merge this in 2.0.

@gatorsmile
Copy link
Member Author

Sure, let me know whether I need to submit another PR for backporting to 2.0. Thanks!

@rxin
Copy link
Contributor

rxin commented Jul 8, 2016

cc @cloud-fan and @liancheng for review.

sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
LastOptions.parameters = parameters
LastOptions.schema = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why assign values here? We don't check them in the test right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is from the another default source. It is useless in this test case. Let me clean them. Thanks!

@cloud-fan
Copy link
Contributor

Nice catch! The fix LGTM.

override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
LastOptions.parameters = parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, is it a convention to assign parameters to LastOptions? I think the test can work without it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure. Let me remove all these things. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61984 has finished for PR 14075 at commit 405d5d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61990 has finished for PR 14075 at commit fb8fd3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jul 9, 2016
… and CreatableRelationProvider without Extending SchemaRelationProvider

#### What changes were proposed in this pull request?
When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation.
```Scala
spark.read
.format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .load()
  .write.
format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .save()
```

The error they hit is like
```
org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
```

Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](https://github.com/gatorsmile/spark/blob/dd644f8117e889cebd6caca58702a7c7e3d88bef/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L429) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`.

#### How was this patch tested?
Added a test case.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14075 from gatorsmile/dataSource.

(cherry picked from commit 7374e51)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in 7374e51 Jul 9, 2016
@cloud-fan
Copy link
Contributor

thanks, merging to master and 2.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants