[SPARK-27521][SQL] Move data source v2 to catalyst module #24416

cloud-fan · 2019-04-19T15:11:50Z

What changes were proposed in this pull request?

Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.

I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.

Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update Analyzer and SessionCatalog to support the new catalog plugin, which needs to be in the catalyst module.

This PR can solve the problem we have in #24246

How was this patch tested?

existing tests

cloud-fan · 2019-04-19T15:14:13Z

No matter how we are going to change the package organization, I think there is no objection to move data source v2 to sql/catalyst.

cc @rxin @rdblue @mccheah @gatorsmile @gengliangwang

SparkQA · 2019-04-19T15:20:02Z

Test build #104748 has finished for PR 24416 at commit c62aae3.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-19T16:11:20Z

Hi, @cloud-fan . Could you fix the build error?

[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java:21:  error: cannot find symbol
[error] import org.apache.spark.sql.sources.DataSourceRegister;
[error]

SparkQA · 2019-04-22T04:19:46Z

Test build #104796 has finished for PR 24416 at commit afc2fec.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait DataSourceRegister

SparkQA · 2019-04-22T16:01:13Z

Test build #104808 has finished for PR 24416 at commit 0163dec.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait DataSourceRegister

HeartSaVioR · 2019-04-24T06:08:33Z

I remind sql/catalyst module is (implicitly) tend to be treated as non-public so leaving classes as public was OK - if we decide to place public APIs to sql/catalyst and treat sql/catalyst module be public, there might be some spots we may want to explicitly hide.

mccheah · 2019-04-24T15:48:31Z

@HeartSaVioR this has been discussed before - I think while that was previously the case, the current push to Data Source V2 necessitates us to change that opinion. There might be a different Maven module organization that is sensible, for example having an sql-api or catalyst-api module.

cloud-fan · 2019-04-25T03:40:10Z

I remind sql/catalyst module is (implicitly) tend to be treated as non-public

Actually this is not true, at least the public DataType API is in sql/catalyst.

HyukjinKwon

@cloud-fan, can you double check if API doc is appropriately generated, and clarify this PR doesn't target to move under catalyst package, org/apache/spark/sql/catalyst in PR description?

rdblue · 2019-05-01T18:22:49Z

Looks like this moved the v2 package and the vectorized package. Is it necessary to move ArrowColumnVector? That seems like an implementation class and not an interface.

Also, does it make sense to move DataSourceRegister? That seems like a part of v1 that we've just reused. Maybe we should add a method to TableProvider instead of moving the trait?

gengliangwang · 2019-05-01T19:49:26Z

Looks like this moved the v2 package and the vectorized package. Is it necessary to move ArrowColumnVector? That seems like an implementation class and not an interface.

I think @cloud-fan move the vectorized package because org.apache.spark.sql.vectorized.ColumnarBatch is used in PartitionReaderFactory .

rdblue · 2019-05-01T20:20:09Z

@gengliangwang, I agree. It looks like the entire package was moved. Is it necessary to move the entire package, or should this move just the interfaces?

gengliangwang · 2019-05-01T20:34:04Z

If it is in the sql package, we can't use it catalyst package unless we create new abstraction in API level. Also, ColumnarBatch is a final class. Though ugly, it seems necessary to me.

rdblue · 2019-05-01T23:54:43Z

Sounds fine to me.

dongjoon-hyun · 2019-05-02T14:57:02Z

Could you resolve the conflicts please? @cloud-fan

SparkQA · 2019-05-27T07:16:11Z

Test build #105818 has finished for PR 24416 at commit 4d56a2b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-27T07:23:56Z

Test build #105819 has finished for PR 24416 at commit ab4f40c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-27T07:47:06Z

Test build #105822 has finished for PR 24416 at commit c9e00bc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-31T14:18:59Z

Test build #106018 has finished for PR 24416 at commit 4b9c964.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-31T14:20:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala

@@ -15,7 +15,7 @@
 * limitations under the License.
 */

-package org.apache.spark.sql.execution.arrow
+package org.apache.spark.sql.util


I don't want to add an execution package in catalyst, so I changed the pacakge for this internal class during the move.

cloud-fan · 2019-05-31T14:21:51Z

sql/catalyst/pom.xml

@@ -114,6 +114,10 @@
      <version>2.7.3</version>
      <type>jar</type>
    </dependency>
+    <dependency>
+      <groupId>org.apache.arrow</groupId>
+      <artifactId>arrow-vector</artifactId>


Since the ArrowColumnVector is moved to the catalyst module, I have to move the dependency as well. I think it's fine as sql/core depends on sql/catalyst.

SparkQA · 2019-05-31T14:31:19Z

Test build #106020 has finished for PR 24416 at commit c1b5932.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-31T17:30:41Z

I moved a few public classes from sql/core to sql/catalyst, without changing the package name. I think this is binary compatible as sql/core depends on sql/catalyst.

I'm not sure if the mima failures are legitimate, do you have any ideas? @srowen @JoshRosen

srowen · 2019-05-31T19:12:20Z

The MiMa warning is legit because those classes are no longer in core. catalyst depends on core, right, not the other way around? so anyone that only depends on core won't see these classes anymore. That could be fine as a breaking change for Spark 3, but it's a legit warning.

rdblue · 2019-05-31T19:23:42Z

sql/core depends on sql/catalyst, so the classes should always be there. When looking at sql/core on its own, this is legitimate. But it should be safe to make this change because the classes are always available if you have all of the required dependencies.

srowen · 2019-05-31T20:32:58Z

Am I crazy, or is it other way around? I don't see a catalyst dependency from core, and wouldn't actually expect that? But yeah I'm agreeing that this is an OK 'breaking' change to exclude.

mccheah · 2019-05-31T20:43:44Z

sql/core depends on sql/catalyst: https://github.com/apache/spark/blob/master/sql/core/pom.xml#L63

Though both are dependent on spark-core.

(Edit: mixed it up, fixed comment for accuracy)

srowen · 2019-05-31T20:47:55Z

Oh, I kept thinking and looking at core not sql/core. Right. Well, same conclusion.

SparkQA · 2019-06-01T05:14:08Z

Test build #106044 has finished for PR 24416 at commit 9220e78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-06-05T07:14:02Z

retest this please

gatorsmile

LGTM

Also cc @zsxwing

SparkQA · 2019-06-05T09:19:50Z

Test build #106194 has finished for PR 24416 at commit 9220e78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-06-05T16:56:07Z

Thanks! Merged to master.

dongjoon-hyun · 2019-06-05T17:03:32Z

Finally, Nice! Thank you, guys.

Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core. I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go. Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module. This PR can solve the problem we have in apache#24246 existing tests Closes apache#24416 from cloud-fan/move. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? `BaseStreamingSource` and `BaseStreamingSink` is used to unify v1 and v2 streaming data source API in some code paths. This PR removes these 2 interfaces, and let the v1 API extend v2 API to keep API compatibility. The motivation is apache#24416 . We want to move data source v2 to catalyst module, but `BaseStreamingSource` and `BaseStreamingSink` are in sql/core. ## How was this patch tested? existing tests Closes apache#24471 from cloud-fan/streaming. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core. I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go. Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module. This PR can solve the problem we have in apache#24246 ## How was this patch tested? existing tests Closes apache#24416 from cloud-fan/move. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

leobenkel · 2021-09-06T13:58:14Z

Hello,

I am trying to migrate from Spark 2.x to 3.x and replacing those

import org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter
import org.apache.spark.sql.sources.v2.{DataSourceOptions, StreamWriteSupport}

had been challenging. My investigation conducted me here, I read the code change and added sql-catalyst to my dependencies but can't figure out where those files went.

Anyone could point me in the right direction ? Thanks !

cloud-fan · 2021-09-07T08:52:04Z

DS v2 has evolved a lot from Spark 2 to 3. I'm afraid you may need to stay with Spark 2 or rewrite your v2 source entirely.

leobenkel · 2021-09-07T09:27:24Z

DS v2 has evolved a lot from Spark 2 to 3. I'm afraid you may need to stay with Spark 2 or rewrite your v2 source entirely.

Thank you for your answer !

I dont mind rewriting it. Would you have a tutorial I can follow ? I wasn't able to find any by myself.

cloud-fan · 2021-09-07T09:31:29Z

You can always find it in tests :P

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

leobenkel · 2021-09-07T09:46:34Z

You can always find it in tests :P

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

Super, I will take a look. Thank you so much for your help ! Have a wonderful rest of your day.

cloud-fan force-pushed the move branch from afc2fec to 0163dec Compare April 22, 2019 15:51

HyukjinKwon reviewed Apr 25, 2019

View reviewed changes

cloud-fan mentioned this pull request Apr 26, 2019

[SPARK-27579][SQL] remove BaseStreamingSource and BaseStreamingSink #24471

Closed

This was referenced May 7, 2019

[SPARK-27650][SQL] separate the row iterator functionality from ColumnarBatch #24546

Closed

[SPARK-27576][SQL] table capability to skip the output column resolution #24469

Closed

cloud-fan added 2 commits May 27, 2019 15:04

move data source v2 to catalyst module

b961690

move DataSourceRegister and Filter

4d56a2b

cloud-fan force-pushed the move branch from 0163dec to 4d56a2b Compare May 27, 2019 07:04

update

f9b13fb

cloud-fan added 2 commits May 27, 2019 15:17

fix a mistake

b8f1316

another mistake

ab4f40c

move ArrowColumnVector to internal package

c9e00bc

fix a mistake

7af4ed1

move

c1b5932

cloud-fan force-pushed the move branch from 4b9c964 to c1b5932 Compare May 31, 2019 14:17

cloud-fan commented May 31, 2019

View reviewed changes

fix mima

9220e78

gatorsmile reviewed Jun 5, 2019

View reviewed changes

gatorsmile closed this in 8b6232b Jun 5, 2019

rdblue mentioned this pull request Jun 5, 2019

[SPARK-27322][SQL] DataSourceV2 table relation #24741

Closed

[SPARK-27521][SQL] Move data source v2 to catalyst module #24416

[SPARK-27521][SQL] Move data source v2 to catalyst module #24416

Uh oh!

Conversation

cloud-fan commented Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 19, 2019

Uh oh!

SparkQA commented Apr 19, 2019

Uh oh!

dongjoon-hyun commented Apr 19, 2019

Uh oh!

SparkQA commented Apr 22, 2019

Uh oh!

SparkQA commented Apr 22, 2019

Uh oh!

HeartSaVioR commented Apr 24, 2019

Uh oh!

mccheah commented Apr 24, 2019

Uh oh!

cloud-fan commented Apr 25, 2019

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 1, 2019

Uh oh!

gengliangwang commented May 1, 2019

Uh oh!

rdblue commented May 1, 2019

Uh oh!

gengliangwang commented May 1, 2019

Uh oh!

rdblue commented May 1, 2019

Uh oh!

dongjoon-hyun commented May 2, 2019

Uh oh!

SparkQA commented May 27, 2019

Uh oh!

SparkQA commented May 27, 2019

Uh oh!

SparkQA commented May 27, 2019

Uh oh!

SparkQA commented May 31, 2019

Uh oh!

cloud-fan May 31, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 31, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 31, 2019

Uh oh!

cloud-fan commented May 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 31, 2019

Uh oh!

rdblue commented May 31, 2019

Uh oh!

srowen commented May 31, 2019

Uh oh!

mccheah commented May 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 31, 2019

Uh oh!

SparkQA commented Jun 1, 2019

Uh oh!

gatorsmile commented Jun 5, 2019

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 5, 2019

Uh oh!

cloud-fan commented Apr 19, 2019 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

cloud-fan commented May 31, 2019 •

edited

Loading

mccheah commented May 31, 2019 •

edited

Loading

leobenkel commented Sep 7, 2021 •

edited

Loading