Skip to content

[SPARK-27521][SQL] Move data source v2 to catalyst module #24416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 19, 2019

What changes were proposed in this pull request?

Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.

I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.

Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update Analyzer and SessionCatalog to support the new catalog plugin, which needs to be in the catalyst module.

This PR can solve the problem we have in #24246

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

No matter how we are going to change the package organization, I think there is no objection to move data source v2 to sql/catalyst.

cc @rxin @rdblue @mccheah @gatorsmile @gengliangwang

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104748 has finished for PR 24416 at commit c62aae3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . Could you fix the build error?

[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java:21:  error: cannot find symbol
[error] import org.apache.spark.sql.sources.DataSourceRegister;
[error]           

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104796 has finished for PR 24416 at commit afc2fec.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait DataSourceRegister

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104808 has finished for PR 24416 at commit 0163dec.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait DataSourceRegister

@HeartSaVioR
Copy link
Contributor

I remind sql/catalyst module is (implicitly) tend to be treated as non-public so leaving classes as public was OK - if we decide to place public APIs to sql/catalyst and treat sql/catalyst module be public, there might be some spots we may want to explicitly hide.

@mccheah
Copy link
Contributor

mccheah commented Apr 24, 2019

@HeartSaVioR this has been discussed before - I think while that was previously the case, the current push to Data Source V2 necessitates us to change that opinion. There might be a different Maven module organization that is sensible, for example having an sql-api or catalyst-api module.

@cloud-fan
Copy link
Contributor Author

I remind sql/catalyst module is (implicitly) tend to be treated as non-public

Actually this is not true, at least the public DataType API is in sql/catalyst.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, can you double check if API doc is appropriately generated, and clarify this PR doesn't target to move under catalyst package, org/apache/spark/sql/catalyst in PR description?

@rdblue
Copy link
Contributor

rdblue commented May 1, 2019

Looks like this moved the v2 package and the vectorized package. Is it necessary to move ArrowColumnVector? That seems like an implementation class and not an interface.

Also, does it make sense to move DataSourceRegister? That seems like a part of v1 that we've just reused. Maybe we should add a method to TableProvider instead of moving the trait?

@gengliangwang
Copy link
Member

Looks like this moved the v2 package and the vectorized package. Is it necessary to move ArrowColumnVector? That seems like an implementation class and not an interface.

I think @cloud-fan move the vectorized package because org.apache.spark.sql.vectorized.ColumnarBatch is used in PartitionReaderFactory .

@rdblue
Copy link
Contributor

rdblue commented May 1, 2019

@gengliangwang, I agree. It looks like the entire package was moved. Is it necessary to move the entire package, or should this move just the interfaces?

@gengliangwang
Copy link
Member

If it is in the sql package, we can't use it catalyst package unless we create new abstraction in API level. Also, ColumnarBatch is a final class. Though ugly, it seems necessary to me.

@rdblue
Copy link
Contributor

rdblue commented May 1, 2019

Sounds fine to me.

@dongjoon-hyun
Copy link
Member

Could you resolve the conflicts please? @cloud-fan

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105818 has finished for PR 24416 at commit 4d56a2b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105819 has finished for PR 24416 at commit ab4f40c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105822 has finished for PR 24416 at commit c9e00bc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2019

Test build #106018 has finished for PR 24416 at commit 4b9c964.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -15,7 +15,7 @@
* limitations under the License.
*/

package org.apache.spark.sql.execution.arrow
package org.apache.spark.sql.util
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to add an execution package in catalyst, so I changed the pacakge for this internal class during the move.

@@ -114,6 +114,10 @@
<version>2.7.3</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-vector</artifactId>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ArrowColumnVector is moved to the catalyst module, I have to move the dependency as well. I think it's fine as sql/core depends on sql/catalyst.

@SparkQA
Copy link

SparkQA commented May 31, 2019

Test build #106020 has finished for PR 24416 at commit c1b5932.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented May 31, 2019

I moved a few public classes from sql/core to sql/catalyst, without changing the package name. I think this is binary compatible as sql/core depends on sql/catalyst.

I'm not sure if the mima failures are legitimate, do you have any ideas? @srowen @JoshRosen

@srowen
Copy link
Member

srowen commented May 31, 2019

The MiMa warning is legit because those classes are no longer in core. catalyst depends on core, right, not the other way around? so anyone that only depends on core won't see these classes anymore. That could be fine as a breaking change for Spark 3, but it's a legit warning.

@rdblue
Copy link
Contributor

rdblue commented May 31, 2019

sql/core depends on sql/catalyst, so the classes should always be there. When looking at sql/core on its own, this is legitimate. But it should be safe to make this change because the classes are always available if you have all of the required dependencies.

@srowen
Copy link
Member

srowen commented May 31, 2019

Am I crazy, or is it other way around? I don't see a catalyst dependency from core, and wouldn't actually expect that? But yeah I'm agreeing that this is an OK 'breaking' change to exclude.

@mccheah
Copy link
Contributor

mccheah commented May 31, 2019

sql/core depends on sql/catalyst: https://github.com/apache/spark/blob/master/sql/core/pom.xml#L63

Though both are dependent on spark-core.

(Edit: mixed it up, fixed comment for accuracy)

@srowen
Copy link
Member

srowen commented May 31, 2019

Oh, I kept thinking and looking at core not sql/core. Right. Well, same conclusion.

@SparkQA
Copy link

SparkQA commented Jun 1, 2019

Test build #106044 has finished for PR 24416 at commit 9220e78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Also cc @zsxwing

@SparkQA
Copy link

SparkQA commented Jun 5, 2019

Test build #106194 has finished for PR 24416 at commit 9220e78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@gatorsmile gatorsmile closed this in 8b6232b Jun 5, 2019
@dongjoon-hyun
Copy link
Member

Finally, Nice! Thank you, guys.

mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.

I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.

Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module.

This PR can solve the problem we have in apache#24246

existing tests

Closes apache#24416 from cloud-fan/move.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
## What changes were proposed in this pull request?

`BaseStreamingSource` and `BaseStreamingSink` is used to unify v1 and v2 streaming data source API in some code paths.

This PR removes these 2 interfaces, and let the v1 API extend v2 API to keep API compatibility.

The motivation is apache#24416 . We want to move data source v2 to catalyst module, but `BaseStreamingSource` and `BaseStreamingSink` are in sql/core.

## How was this patch tested?

existing tests

Closes apache#24471 from cloud-fan/streaming.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.

I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.

Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module.

This PR can solve the problem we have in apache#24246

## How was this patch tested?

existing tests

Closes apache#24416 from cloud-fan/move.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@leobenkel
Copy link

Hello,

I am trying to migrate from Spark 2.x to 3.x and replacing those

import org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter
import org.apache.spark.sql.sources.v2.{DataSourceOptions, StreamWriteSupport}

had been challenging. My investigation conducted me here, I read the code change and added sql-catalyst to my dependencies but can't figure out where those files went.

Anyone could point me in the right direction ? Thanks !

@cloud-fan
Copy link
Contributor Author

DS v2 has evolved a lot from Spark 2 to 3. I'm afraid you may need to stay with Spark 2 or rewrite your v2 source entirely.

@leobenkel
Copy link

leobenkel commented Sep 7, 2021

DS v2 has evolved a lot from Spark 2 to 3. I'm afraid you may need to stay with Spark 2 or rewrite your v2 source entirely.

Thank you for your answer !

I dont mind rewriting it. Would you have a tutorial I can follow ? I wasn't able to find any by myself.

@cloud-fan
Copy link
Contributor Author

@leobenkel
Copy link

You can always find it in tests :P

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

Super, I will take a look. Thank you so much for your help ! Have a wonderful rest of your day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.