-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build: Move Spark version modules under spark directory #3256
Conversation
Thanks for the PR! There are a few aspects I'd like to discuss:
My original thought in #3237 was that we can define the rule that the shared code can continue to live, and any code that cannot build againt all versions should then be refactored to different version modules. The intention is to minimize duplicated code across version modules, but I don't have enough context yet if this is worth the effort, maybe after refactoring the current code to build across Spark 2 and 3, we don't have much shared code anyway.
In #3237 I have the concept of a build version and source version. The definition is that build version is the actual version we build Spark, and source version is the smallest version that would be forward compatible until we create the next source version. So in this case, we will have source version v2.4 that can build Spark 2.4.x, source version v3.0 that can build Spark 3.0.x and 3.1.x. And we will have a v3.2 for build versions v3.2.x and forward until we cannot build against v3.2 source. The intention is again to minimize duplicated code by minimizing the number of version folders.
In #3237 I had another
This is just something not yet shown in this PR, for example for v3.2, would the package names be something like Also combining with Scala version consideration, if we support that, should we also publish artifacts for all Scala versions we support like what Spark does? |
throw new GradleException("Found unsupported Spark versions: " + (sparkVersions - knownSparkVersions)) | ||
} | ||
|
||
if (sparkVersions.contains("3.0")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need to add 3.1 module? Seems like we should be fine with 3.0 only unless someone in the community wants to leverage some 3.1 APIs?
I'd probably just add 3.2 next and focus on that version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for just add 3.2 next, that's why I was trying to introduce the concept of build and src versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think about this a bit more. I think that we should be able to run the tests and integration tests against 3.1. I think the easy way to do that is to introduce a 3.1 build directory. Maybe it can share source with 3.0 though? I just want to avoid having extra ways of calling tests for just one build of Spark. Ideally, we'd add on set of actions to test each version of Spark individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine either way.
*/ | ||
|
||
project(':iceberg-spark') { | ||
configurations.all { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need all these dependencies if the same is also defined for each specific version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it supposed to include common configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it just temporary and will be removed once we move the code into subfolders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change the individual builds, other than to remove the Spark 3.1 parts of the Spark 3.0 build. We can refactor these as needed once the change is done.
But, I still think it's a good trade-off to avoid trying to share code. The version numbers may change across releases, or maybe Spark starts shading something and we no longer need to ship it. I think we should not worry about what's in some common config and just update the individual builds as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: I don't think that this common Spark module needs to exist in the long term. For Spark 3.2, I think we should move all the code in iceberg-spark into the 3.2 build.
spark/build.gradle
Outdated
} | ||
|
||
// add enabled Spark version modules to the build | ||
project.ext.defaultSparkVersions = "2.4,3.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the question above, can we just keep this part in this particular build.gradle
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have parameterization in settings.gradle
, do we really need the this build file once the code is moved into subfolders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably share the default versions and known versions. But we do need the if
statements here to load the correct build files if the projects were defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused initially as I did not recognize we still kept the common module. The more I look, the more it seems we should get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the known and default Spark version lists into gradle.properties
so that they are shared.
As for getting rid of the common module, I'm all for that. I just want to make this as small as possible.
While I am generally against sharing code between any versions of Spark, I am 50/50 here. We could probably have a common module for very basic things but then we won't compile that code against the exact Spark version we will be running with. Since the main development will be done against the most recent version, I'd slightly prefer duplicating the classes as then we won't have to reason how public and stable Spark API, which are used in that common module, are.
I'd vote for no compatibility between minor versions, only patch releases. We tried that in the past with 3.0 and 3.1 and it slows down the development and makes the life much harder. We are using Spark APIs that change frequently. I'd say having a module for each minor Spark version would be better. I think there will be major features in each upcoming minor Spark version that we will want to consume ASAP. |
@RussellSpitzer @szehon-ho @flyrain, any thoughts? |
settings.gradle
Outdated
project(':pig').name = 'iceberg-pig' | ||
project(':hive-metastore').name = 'iceberg-hive-metastore' | ||
project(':nessie').name = 'iceberg-nessie' | ||
|
||
String defaultSparkVersions = "2.4,3.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are defining this as a local variable and as a project property? Also should we store this as a list? Seems a little odd that we split it to turn it into a list immediately after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to look into how to deduplicate this. And we can avoid parsing it as well.
I think that compiling against different versions of Spark is beginning to be a significant risk due to changes across versions. That's only going to get worse when we start adding flags for Scala version as well and continue to add more extensions that touch Spark internals. I don't think it's a good idea to try to refactor common classes out unless they don't have a compile dependency on Spark. And in that case, we can more them to core like the
I think we're actually better off the other way. What we have now does some pretty crazy things to deal with multiple Spark versions and I think we'll end up with cleaner code if we just duplicate it. This is one of the reasons why I think we want these build changes.
Yes, but as a follow-up. This is just a PR to put a basic structure in place and we will continue to improve it in future commits.
We don't need to decide this now, since this doesn't introduce new modules yet. I'd like to focus on getting this in with minimal changes since it's already so large! |
We should probably wait with merging any other Spark related PRs before this is in. |
Yeah, looks like since we're moving files around this hits commit conflicts really quickly. We should try to get consensus around the file moves and structure as soon as possible and then get this in. That's partly why I wanted to make this first commit do as little as possible. We can add more refactoring later, but it's easiest to do this as multiple focused commits with only one big file move! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any additional questions. I agree that the risk of keeping a common Spark module looks high especially with the Scala extensions, and we can start 3.2 clean without dependency of the common module, and begin to work on getting rid of it completely. And given the current tendency of Spark backwards incompatibility across minor versions, it's good enough to directly have a 1:1 source and build correspondence. Since we need to merge this PR asap to avoid conflicts piling up, I will hold some discussions to subsequent PRs.
3096147
to
d2e3269
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, @rdblue! Thanks!
I assume the next task would be to get rid of the common Spark module?
Yeah, removing the common module is a good next step. Or just leaving it and moving on to a build for Spark 3.2. We should also decide how we want to handle 3.1. I'm leaning toward creating a module for it so we can possibly backport support for metadata columns, remove some of the reflection code, and eventually support a newer version of Scala. |
A separate 3.1 sounds reasonable to me. It will follow the new naming, right? With the Spark and Scala versions in it? |
Yes, I think it should follow new naming conventions so that we can add Scala version to it. |
Merged! Thanks for the reviews, @jackye1995 and @aokolnychyi! |
This is another option for refactoring the Spark build. This is similar to #3237, but it also moves Spark projects into separate build.gradle files and allows building multiple Spark versions in parallel.
This removes Spark 3.1 testing from the Spark 3.0 project because 3.0 will be copied to produce a 3.1 build. Otherwise, Spark builds are unchanged to keep changes small.
There should also be further refactoring after these build changes to copy the contents of
iceberg-spark
into each version to ensure all classes are compiled against the correct version of Spark and Scala.This is the new layout:
Work to do after this:
iceberg-spark
and move into version-specific modulesspark3-extensions
jmh.gradle