Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use spark-proto uber jar to test the connect module #42236

Closed
wants to merge 15 commits into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Jul 31, 2023

What changes were proposed in this pull request?

Before this pr, when we tested the connect module, Maven used the shaded spark-protobuf jar for testing, while SBT used the original jar for testing, which also led to inconsistent testing behavior. So some tests passed when using SBT, but failed when using Maven:

run

build/mvn clean install -DskipTests
build/mvn test -pl connector/connect/server

there will be two test failed as follows:

- from_protobuf_messageClassName *** FAILED ***
  org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could not load Protobuf class with name org.apache.spark.connect.proto.StorageLevel. org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with Protobuf classes needs to be shaded (com.google.protobuf.* --> org.sparkproject.spark_protobuf.protobuf.*).
  at org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
  at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
  at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)


- from_protobuf_messageClassName_options *** FAILED ***
  org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could not load Protobuf class with name org.apache.spark.connect.proto.StorageLevel. org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with Protobuf classes needs to be shaded (com.google.protobuf.* --> org.sparkproject.spark_protobuf.protobuf.*).
  at org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
  at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
  at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 

So this pr make SBT also use spark-proto uber jar(spark-protobuf-assembly-**-SNAPSHOT.jar) for the above tests and refactor the test cases to make them pass both SBT and Maven after this pr.

Why are the changes needed?

Make connect server module can test pass using both SBT and maven.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass Github Actions
  • Manual check
build/mvn clean install -DskipTests
build/mvn test -pl connector/connect/server

all test passed after this pr.

@LuciferYang LuciferYang changed the title [SPARK-43646][CONNECT] Test proto case [SPARK-43646][CONNECT] Make sbt use spark-proto uber jar to test the connect module Jul 31, 2023
@LuciferYang LuciferYang changed the title [SPARK-43646][CONNECT] Make sbt use spark-proto uber jar to test the connect module [SPARK-43646][CONNECT] Make sbt use spark-proto uber jar to test the connect module Jul 31, 2023
@LuciferYang LuciferYang changed the title [SPARK-43646][CONNECT] Make sbt use spark-proto uber jar to test the connect module [SPARK-43646][CONNECT] Make both sbt and maven use spark-proto uber jar to test the connet module Jul 31, 2023
@LuciferYang LuciferYang changed the title [SPARK-43646][CONNECT] Make both sbt and maven use spark-proto uber jar to test the connet module [SPARK-43646][CONNECT] Make both SBT and Maven use spark-proto uber jar to test the connet module Jul 31, 2023
@LuciferYang LuciferYang marked this pull request as ready for review July 31, 2023 10:24
@LuciferYang LuciferYang changed the title [SPARK-43646][CONNECT] Make both SBT and Maven use spark-proto uber jar to test the connet module [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use spark-proto uber jar to test the connet module Jul 31, 2023
@LuciferYang LuciferYang changed the title [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use spark-proto uber jar to test the connet module [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use spark-proto uber jar to test the connect module Aug 1, 2023
@LuciferYang LuciferYang changed the title [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use spark-proto uber jar to test the connect module [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use spark-proto uber jar to test the connect module Aug 1, 2023
@@ -3233,14 +3233,15 @@ class PlanGenerationTestSuite
"connect/common/src/test/resources/protobuf-tests/common.desc"

test("from_protobuf messageClassName") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we deem this plan too intricate, we could delete or ignore test cases from_protobuf messageClassName and from_protobuf messageClassName options to temporarily abandon this testing scenario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you.

* * <p>
* * TODO(SPARK-44606): Generate this file and replace the package names in the file when testing.
*/
public final class TestProto {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task of generating Java files during testing and replacing all instances of com.google.protobuf. with org.sparkproject.spark_protobuf.protobuf. in the file will be carried forward as a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the doc comment to say why this is included (the current comment talks about a fix, but not the problem). Looks like this uses shaded version of protobuf library.

@LuciferYang
Copy link
Contributor Author

@HyukjinKwon
Copy link
Member

I like this but cc @grundprinzip and @zhenlineo too

@zhenlineo
Copy link
Contributor

Would it be easier if we change maven to use the unshaded jar?

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 4, 2023

Would it be easier if we change maven to use the unshaded jar?

<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<shadeTestJar>false</shadeTestJar>
<artifactSet>

Since shadedArtifactAttached is configured to be true, Maven will no longer publish the unshaded jar. Following Spark's previous practice, if both unshaded jars and shaded jars are needed, then it needs to be split into two modules, such as kafka-0-10 and kafka-0-10-assembly, kinesis-asl and kinesis-asl-assembly. So, if we need Maven to publish the unshaded jar, we need to split the protobuf module into two modules: protobuf and protobuf-assembly. I had a previous PR that did this work: #41466.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 4, 2023

While I'm not certain if it's reasonable, I still want to point out that relocating the content of the spark-protobuf module may result to a poorer user experience: In order to use this sql function, users have no choice but to relocate the content of the Java PB description files used in their business according to Spark's project rules(replcace all com.google.protobuf.
toorg.sparkproject.spark_protobuf.protobuf. in their generated Java files, Perhaps I misunderstood something?). Is this really user-friendly for existing data and businesses? Meanwhile, the spark-protobuf module is a module that won't be packaged into spark-client tar ball, is it very risky to only publish unshaded jars? @rangadi @HyukjinKwon

@LuciferYang
Copy link
Contributor Author

What suggestions does everyone have on how to proceed with fixing this issue? Should we have it fixed before the next 3.5.0 RC release?

@HyukjinKwon
Copy link
Member

What are the standing comments? Otherwise, I would guess its good to go?

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 22, 2023

@HyukjinKwon Hmm... I think the fix should be fine. However, as Spark 3.5 no longer needs to support protobuf 2.5, I hope to make some changes to the Experimental Apis involved in this pr to make it no longer intrude into user code.

The related viewpoints include:

#42236 (comment),
#42236 (comment),
#42236 (comment)

Of course, we could also choose to fix this issue for Spark 3.5 using this pr first and have more discussions or changes for this in Spark 4.0.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 24, 2023

4ba45e0 has automated the process of generating .java from .proto and replacing all instances of com.google.protobuf. with org.sparkproject.spark_protobuf.protobuf. in the .java files. Thus, the TODO task SPARK-44606 is no longer needed.

@LuciferYang
Copy link
Contributor Author

@HyukjinKwon @rangadi Since there are new code changes in 4ba45e0, could you please take another look? If there are no new issues, I will merge this pr if GA passes.

@LuciferYang
Copy link
Contributor Author

GA passed and I have manually verified with Maven, the tests can pass.

@HyukjinKwon
Copy link
Member

@LuciferYang BTW, is this a real blocker? or test-only issue?

@LuciferYang
Copy link
Contributor Author

@LuciferYang BTW, is this a real blocker? or test-only issue?

hmm... If we accept that Maven tests inevitably fail, then this is not a blocker

@LuciferYang
Copy link
Contributor Author

@HyukjinKwon So I don't think this is a real blocker, but we should at least make the maven test pass in master/branch-3.5, even if it means removing these two cases.

@LuciferYang
Copy link
Contributor Author

I will merge this pr today If there are no more comments @rangadi @HyukjinKwon @hvanhovell @zhengruifeng @rangadi @pan3793 @grundprinzip @zhenlineo

LuciferYang added a commit that referenced this pull request Aug 29, 2023
…o` uber jar to test the `connect` module

### What changes were proposed in this pull request?
Before this pr, when we tested the `connect` module, Maven used the shaded `spark-protobuf` jar for testing, while SBT used the original jar for testing, which also led to inconsistent testing behavior. So some tests passed when using SBT, but failed when using Maven:

run

```
build/mvn clean install -DskipTests
build/mvn test -pl connector/connect/server
```

there will be two test failed as follows:

```
- from_protobuf_messageClassName *** FAILED ***
  org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could not load Protobuf class with name org.apache.spark.connect.proto.StorageLevel. org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with Protobuf classes needs to be shaded (com.google.protobuf.* --> org.sparkproject.spark_protobuf.protobuf.*).
  at org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
  at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
  at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)

- from_protobuf_messageClassName_options *** FAILED ***
  org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could not load Protobuf class with name org.apache.spark.connect.proto.StorageLevel. org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with Protobuf classes needs to be shaded (com.google.protobuf.* --> org.sparkproject.spark_protobuf.protobuf.*).
  at org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193)
  at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
  at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
  at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
  at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
```

So this pr make SBT also use `spark-proto` uber jar(`spark-protobuf-assembly-**-SNAPSHOT.jar`) for the above tests and refactor the test cases to make them pass both SBT and Maven after this pr.

### Why are the changes needed?
Make connect server module can test pass using both SBT and maven.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass Github Actions
- Manual check

```
build/mvn clean install -DskipTests
build/mvn test -pl connector/connect/server
```

all test passed after this pr.

Closes #42236 from LuciferYang/protobuf-test.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit df63adf)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
@LuciferYang
Copy link
Contributor Author

Merged into master and branch-3.5 to fix the maven test for the connect-server module

@LuciferYang
Copy link
Contributor Author

Thanks all

@MaxGekk
Copy link
Member

MaxGekk commented Aug 31, 2023

After this changes, I cannot compile Spark as a maven project in IntelliJ IDEA anymore. It fails with the error (see the screenshot). Any idea how to fix the issue?

java: package org.sparkproject.spark_protobuf.protobuf does not exist
Screenshot 2023-08-31 at 10 02 24

@LuciferYang
Copy link
Contributor Author

@MaxGekk I thought for a while, I have no better way to solve this issue, indeed, it's impossible to be perfect. I have decided to revert this pr and ignore the corresponding two test cases.

LuciferYang added a commit that referenced this pull request Aug 31, 2023
…ark-proto` uber jar to test the `connect` module"

### What changes were proposed in this pull request?
This reverts commit df63adf.

### Why are the changes needed?
As [reported](#42236 (comment)) by MaxGekk , the solution for #42236 is not perfect, and it breaks the usability of importing Spark as a Maven project into idea. On the other hand, if `mvn clean test` is executed, test failures will also occur like
```
[ERROR] [Error] /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:  error: package org.sparkproject.spark_protobuf.protobuf does not exist
```
Therefore, this pr will revert the change of SPARK-43646, and `from_protobuf messageClassName` and `from_protobuf messageClassName options` in `PlanGenerationTestSuite` will be ignored in a follow-up.

At present, it is difficult to make the maven testing of the `spark-protobuf` function in the `connect` module as good as possible.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42746 from LuciferYang/Revert-SPARK-43646.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 723a0aa)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LuciferYang added a commit that referenced this pull request Aug 31, 2023
…ark-proto` uber jar to test the `connect` module"

### What changes were proposed in this pull request?
This reverts commit df63adf.

### Why are the changes needed?
As [reported](#42236 (comment)) by MaxGekk , the solution for #42236 is not perfect, and it breaks the usability of importing Spark as a Maven project into idea. On the other hand, if `mvn clean test` is executed, test failures will also occur like
```
[ERROR] [Error] /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:  error: package org.sparkproject.spark_protobuf.protobuf does not exist
```
Therefore, this pr will revert the change of SPARK-43646, and `from_protobuf messageClassName` and `from_protobuf messageClassName options` in `PlanGenerationTestSuite` will be ignored in a follow-up.

At present, it is difficult to make the maven testing of the `spark-protobuf` function in the `connect` module as good as possible.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42746 from LuciferYang/Revert-SPARK-43646.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LuciferYang added a commit that referenced this pull request Sep 1, 2023
…from_protobuf messageClassName options` in `PlanGenerationTestSuite`

### What changes were proposed in this pull request?
This pr aims ignore `from_protobuf messageClassName` and `from_protobuf messageClassName options` in `PlanGenerationTestSuite` and remove the related golden files, after this change `from_protobuf_messageClassName` and `from_protobuf_messageClassName_options` in `ProtoToParsedPlanTestSuite` be ignored too.

### Why are the changes needed?
SPARK-43646 | (#42236) makes both Maven and SBT use the shaded `spark-protobuf` module when testing the connect module, this allows `mvn clean install` and `mvn package test` to successfully pass tests. But if `mvn clean test` is executed directly, an error `package org.sparkproject.spark_protobuf.protobuf does not exist` will occur. This is because `mvn clean test` directly uses the classes file of the `spark-protobuf` module for testing, without the 'package', hence it does not `shade` and `relocate` protobuf.

On the other hand, the change of SPARK-43646 breaks the usability of importing Spark as a Maven project into IDEA(#42236 (comment)).

So #42746 revert the change of [SPARK-43646](https://issues.apache.org/jira/browse/SPARK-43646).

It's difficult to find a perfect solution to solve this maven test issues now, as in certain scenarios tests would use the `shaded spark-protobuf jar`, like `mvn package test`, while in some other scenarios it will use the `unshaded classes directory`, such as `mvn clean test`. so this pr ignores the relevant tests first and leaves a TODO(SPARK-45030), to re-enable these tests when we find a better solution.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42751 from LuciferYang/SPARK-45029.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LuciferYang added a commit that referenced this pull request Sep 1, 2023
…from_protobuf messageClassName options` in `PlanGenerationTestSuite`

### What changes were proposed in this pull request?
This pr aims ignore `from_protobuf messageClassName` and `from_protobuf messageClassName options` in `PlanGenerationTestSuite` and remove the related golden files, after this change `from_protobuf_messageClassName` and `from_protobuf_messageClassName_options` in `ProtoToParsedPlanTestSuite` be ignored too.

### Why are the changes needed?
SPARK-43646 | (#42236) makes both Maven and SBT use the shaded `spark-protobuf` module when testing the connect module, this allows `mvn clean install` and `mvn package test` to successfully pass tests. But if `mvn clean test` is executed directly, an error `package org.sparkproject.spark_protobuf.protobuf does not exist` will occur. This is because `mvn clean test` directly uses the classes file of the `spark-protobuf` module for testing, without the 'package', hence it does not `shade` and `relocate` protobuf.

On the other hand, the change of SPARK-43646 breaks the usability of importing Spark as a Maven project into IDEA(#42236 (comment)).

So #42746 revert the change of [SPARK-43646](https://issues.apache.org/jira/browse/SPARK-43646).

It's difficult to find a perfect solution to solve this maven test issues now, as in certain scenarios tests would use the `shaded spark-protobuf jar`, like `mvn package test`, while in some other scenarios it will use the `unshaded classes directory`, such as `mvn clean test`. so this pr ignores the relevant tests first and leaves a TODO(SPARK-45030), to re-enable these tests when we find a better solution.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42751 from LuciferYang/SPARK-45029.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 10f3190)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants