-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-27521][SQL] Move data source v2 to catalyst module #24416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
No matter how we are going to change the package organization, I think there is no objection to move data source v2 to sql/catalyst. |
Test build #104748 has finished for PR 24416 at commit
|
Hi, @cloud-fan . Could you fix the build error?
|
Test build #104796 has finished for PR 24416 at commit
|
Test build #104808 has finished for PR 24416 at commit
|
I remind sql/catalyst module is (implicitly) tend to be treated as non-public so leaving classes as public was OK - if we decide to place public APIs to sql/catalyst and treat sql/catalyst module be public, there might be some spots we may want to explicitly hide. |
@HeartSaVioR this has been discussed before - I think while that was previously the case, the current push to Data Source V2 necessitates us to change that opinion. There might be a different Maven module organization that is sensible, for example having an |
Actually this is not true, at least the public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, can you double check if API doc is appropriately generated, and clarify this PR doesn't target to move under catalyst package, org/apache/spark/sql/catalyst
in PR description?
Looks like this moved the v2 package and the vectorized package. Is it necessary to move Also, does it make sense to move |
I think @cloud-fan move the vectorized package because |
@gengliangwang, I agree. It looks like the entire package was moved. Is it necessary to move the entire package, or should this move just the interfaces? |
If it is in the |
Sounds fine to me. |
Could you resolve the conflicts please? @cloud-fan |
Test build #105818 has finished for PR 24416 at commit
|
Test build #105819 has finished for PR 24416 at commit
|
Test build #105822 has finished for PR 24416 at commit
|
Test build #106018 has finished for PR 24416 at commit
|
@@ -15,7 +15,7 @@ | |||
* limitations under the License. | |||
*/ | |||
|
|||
package org.apache.spark.sql.execution.arrow | |||
package org.apache.spark.sql.util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to add an execution
package in catalyst, so I changed the pacakge for this internal class during the move.
@@ -114,6 +114,10 @@ | |||
<version>2.7.3</version> | |||
<type>jar</type> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.apache.arrow</groupId> | |||
<artifactId>arrow-vector</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the ArrowColumnVector
is moved to the catalyst module, I have to move the dependency as well. I think it's fine as sql/core depends on sql/catalyst.
Test build #106020 has finished for PR 24416 at commit
|
I moved a few public classes from sql/core to sql/catalyst, without changing the package name. I think this is binary compatible as sql/core depends on sql/catalyst. I'm not sure if the mima failures are legitimate, do you have any ideas? @srowen @JoshRosen |
The MiMa warning is legit because those classes are no longer in core. catalyst depends on core, right, not the other way around? so anyone that only depends on core won't see these classes anymore. That could be fine as a breaking change for Spark 3, but it's a legit warning. |
sql/core depends on sql/catalyst, so the classes should always be there. When looking at sql/core on its own, this is legitimate. But it should be safe to make this change because the classes are always available if you have all of the required dependencies. |
Am I crazy, or is it other way around? I don't see a catalyst dependency from core, and wouldn't actually expect that? But yeah I'm agreeing that this is an OK 'breaking' change to exclude. |
Though both are dependent on (Edit: mixed it up, fixed comment for accuracy) |
Oh, I kept thinking and looking at |
Test build #106044 has finished for PR 24416 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Also cc @zsxwing
Test build #106194 has finished for PR 24416 at commit
|
Thanks! Merged to master. |
Finally, Nice! Thank you, guys. |
Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core. I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go. Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module. This PR can solve the problem we have in apache#24246 existing tests Closes apache#24416 from cloud-fan/move. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request? `BaseStreamingSource` and `BaseStreamingSink` is used to unify v1 and v2 streaming data source API in some code paths. This PR removes these 2 interfaces, and let the v1 API extend v2 API to keep API compatibility. The motivation is apache#24416 . We want to move data source v2 to catalyst module, but `BaseStreamingSource` and `BaseStreamingSink` are in sql/core. ## How was this patch tested? existing tests Closes apache#24471 from cloud-fan/streaming. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core. I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go. Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module. This PR can solve the problem we have in apache#24246 ## How was this patch tested? existing tests Closes apache#24416 from cloud-fan/move. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Hello, I am trying to migrate from Spark 2.x to 3.x and replacing those
had been challenging. My investigation conducted me here, I read the code change and added Anyone could point me in the right direction ? Thanks ! |
DS v2 has evolved a lot from Spark 2 to 3. I'm afraid you may need to stay with Spark 2 or rewrite your v2 source entirely. |
Thank you for your answer ! I dont mind rewriting it. Would you have a tutorial I can follow ? I wasn't able to find any by myself. |
You can always find it in tests :P |
Super, I will take a look. Thank you so much for your help ! Have a wonderful rest of your day. |
What changes were proposed in this pull request?
Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.
I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.
Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update
Analyzer
andSessionCatalog
to support the new catalog plugin, which needs to be in the catalyst module.This PR can solve the problem we have in #24246
How was this patch tested?
existing tests