-
Notifications
You must be signed in to change notification settings - Fork 370
feat: Create Spark 4 Client #3188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Create Spark 4 Client #3188
Conversation
|
@flyrain & @gh-yzou - This is the same thing as I did here (#3026) before I accidentally blew away my repo. I'd love to have a conversation about whether you would be fine merging this in or if you want to do the module separation first to clean this up. I know that there is a user who wants to have this and having a first version will help us be responsive to our users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces support for Spark 4.0 by creating a new client that mirrors the existing Spark 3.5 implementation. The changes primarily involve copying and adapting existing code to work with Spark 4.0's interfaces and dependencies, with appropriate configuration changes to support Scala 2.13 exclusively for this version.
Key Changes:
- Created a complete Spark 4.0 client implementation in
plugins/spark/v4.0/ - Added version-specific Scala configuration to support different Scala versions per Spark version
- Updated build configurations to handle ANTLR version conflicts between Spark/Delta dependencies
- Included comprehensive test suites (unit, integration, and regression tests)
- Added documentation and Docker setup for testing
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| settings.gradle.kts | Added version-specific Scala version configuration |
| runtime/spark-tests/build.gradle.kts | Forced ANTLR 4.9.3 for Spark 3.5/Delta 3.3 compatibility |
| plugins/spark/v4.0/spark/* | Complete Spark 4.0 client implementation (catalog, REST, utilities) |
| plugins/spark/v4.0/integration/* | Integration tests for Spark 4.0 client |
| plugins/spark/v4.0/regtests/* | Regression test framework and scripts |
| plugins/spark/v3.5/integration/build.gradle.kts | Forced ANTLR 4.9.3 for consistency |
| plugins/spark/v3.5/README.md | New comprehensive documentation for Spark 3.5 |
| plugins/spark/v4.0/README.md | New comprehensive documentation for Spark 4.0 |
| plugins/spark/README.md | Updated to reference version-specific READMEs |
| plugins/spark/spark-scala.properties | Added Spark 4.0 with Scala 2.13 only |
| pluginlibs.versions.toml | Added Spark 4.0.1 version |
| CHANGELOG.md | Added entry for Spark 4.0 client |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks a lot for working on it, Adam! This is a solid direction. Iceberg followed a very similar approach when introducing new Spark version support, copying the existing Spark version client and adapting it accordingly. It keeps the implementation straightforward and reduces the risk of regressions. And we could delete the old versions when users have migrate the new ones. One thing we might consider is preserving the original git commit history when copying the modules. That can make future debugging and backporting work easier, especially when comparing behaviors across Spark versions. |
| configurations.all { | ||
| if (name != "checkstyle") { | ||
| resolutionStrategy { | ||
| force("org.antlr:antlr4-runtime:4.9.3") // Spark 3.5 and Delta 3.3 require ANTLR 4.9.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to force this version? Who tries to make it different and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we have an issue where our integration tests rely on Spark 3.5. Spark 3.5 relies on Spark 3.5 requires ANTLR 4.9.3. Spark 4.0 relies on ANTLR 4.13.1.
This was one way to handle this, but there actually might be a cleaner way by putting this in pluginlibs.versions.toml. I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Spark dependencies clash with our main server (I assume this is the case), then leveraging the Apprunner is probably the cleanest solution. I'm sure it was created right for this purpose (right, @snazy ?). With Apprunner the server build is independent of the test env. and the tests will run the server in isolation. The tests will be able to use anything they need on the class path without clashing with server deps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to close because of #3188 (comment), but I'll handle this post-first merge.
gh-yzou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @adam-christian-software Thanks a lot for working on this! I took a quick look, i assume there is no actual src code change, most of the change is related to dependency and readme, right?
plugins/spark/v4.0/spark/src/main/java/org/apache/polaris/spark/PolarisCatalog.java
Show resolved
Hide resolved
plugins/spark/v4.0/spark/src/main/java/org/apache/polaris/spark/PolarisSparkCatalog.java
Show resolved
Hide resolved
| @@ -0,0 +1,83 @@ | |||
| <!-- | |||
|
|
|||
| Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see the get-started for spark 4.0, do we plan to reuse get-started? shall we move the get-start out for common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I was thinking it made sense to move the getting-started out of the plugin and push it to the getting-started area that we have the other docker-compose files in.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the polaris client side, we mostly intended to treat it as an independent project and the only dependency we would like to put with the main project is the interface part. I think it might be better to let it still stay with polaris spark, but at the top level, what do you think?
Resolved conflict in plugins/spark/README.md by combining both the directory listing for Spark versions and the detailed documentation from upstream.
|
In order to preserve history with the previous client, I'm going to close this PR and open another one using a similar approach as: |
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Context
This is the same change as #3026 but I accidentally destroyed my repo.
This change fixes #3021 by creating the Spark 4 client. Most of the changes are just simply copying and pasting what already exists and changing to match the proper interfaces.
I will carve out the overlapping business logic into a separate module later.
Test
Move over the same testing framework with what exists in the 3.5 plugin. In addition, I tested E2E with Parquet.
Checklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed). I need the updates in test: Add Some Spark Client Tests and Update Documentation on Generic Tables #3000 first.