Build: Move Spark version modules under spark directory #3256

rdblue · 2021-10-08T17:42:51Z

This is another option for refactoring the Spark build. This is similar to #3237, but it also moves Spark projects into separate build.gradle files and allows building multiple Spark versions in parallel.

This removes Spark 3.1 testing from the Spark 3.0 project because 3.0 will be copied to produce a 3.1 build. Otherwise, Spark builds are unchanged to keep changes small.

There should also be further refactoring after these build changes to copy the contents of iceberg-spark into each version to ensure all classes are compiled against the correct version of Spark and Scala.

This is the new layout:

├── spark
│   ├── v2.4
│   │   ├── spark2
│   │   └── spark-runtime
│   └── v3.0
│       ├── spark3
│       │   └── src
│       ├── spark3-extensions
│       │   └── src
│       └── spark3-runtime
│           └── src

Work to do after this:

Refactor iceberg-spark and move into version-specific modules
Add a v3.1 build
Refactoring to remove modules that are no longer needed, like spark3-extensions
Apply similar refactoring to jmh.gradle

jackye1995 · 2021-10-11T18:05:19Z

Thanks for the PR! There are a few aspects I'd like to discuss:

should we continue to keep the existing iceberg-spark module as a common module across all versions

My original thought in #3237 was that we can define the rule that the shared code can continue to live, and any code that cannot build againt all versions should then be refactored to different version modules. The intention is to minimize duplicated code across version modules, but I don't have enough context yet if this is worth the effort, maybe after refactoring the current code to build across Spark 2 and 3, we don't have much shared code anyway.

should we have v3.1 if v3.0 and v3.1 can all be built against v3.0

In #3237 I have the concept of a build version and source version.

The definition is that build version is the actual version we build Spark, and source version is the smallest version that would be forward compatible until we create the next source version.

So in this case, we will have source version v2.4 that can build Spark 2.4.x, source version v3.0 that can build Spark 3.0.x and 3.1.x. And we will have a v3.2 for build versions v3.2.x and forward until we cannot build against v3.2 source.

The intention is again to minimize duplicated code by minimizing the number of version folders.

How would we handle Scala version in this architecture?

In #3237 I had another scalaVersion system property to allow setting a flexible Scala version. Do you think that is something worth adding? It doesn't need to be refactored in this PR, just want to know what you think about Scala versioning.

What would be the naming scheme for modules after v3.0?

This is just something not yet shown in this PR, for example for v3.2, would the package names be something like :iceberg-spark:spark32-runtime, or :iceberg-spark:spark3-2-runtime, or something else?

Also combining with Scala version consideration, if we support that, should we also publish artifacts for all Scala versions we support like what Spark does?

jmh.gradle

aokolnychyi · 2021-10-11T20:14:18Z

settings.gradle

+  throw new GradleException("Found unsupported Spark versions: " + (sparkVersions - knownSparkVersions))
+}
+
+if (sparkVersions.contains("3.0")) {


Do we even need to add 3.1 module? Seems like we should be fine with 3.0 only unless someone in the community wants to leverage some 3.1 APIs?

I'd probably just add 3.2 next and focus on that version.

+1 for just add 3.2 next, that's why I was trying to introduce the concept of build and src versions.

Let me think about this a bit more. I think that we should be able to run the tests and integration tests against 3.1. I think the easy way to do that is to introduce a 3.1 build directory. Maybe it can share source with 3.0 though? I just want to avoid having extra ways of calling tests for just one build of Spark. Ideally, we'd add on set of actions to test each version of Spark individually.

I am fine either way.

settings.gradle

aokolnychyi · 2021-10-11T20:39:01Z

spark/build.gradle

+ */
+
+project(':iceberg-spark') {
+  configurations.all {


Why do we need all these dependencies if the same is also defined for each specific version?

Is it supposed to include common configs?

Or is it just temporary and will be removed once we move the code into subfolders?

I didn't change the individual builds, other than to remove the Spark 3.1 parts of the Spark 3.0 build. We can refactor these as needed once the change is done.

But, I still think it's a good trade-off to avoid trying to share code. The version numbers may change across releases, or maybe Spark starts shading something and we no longer need to ship it. I think we should not worry about what's in some common config and just update the individual builds as needed.

One more thing: I don't think that this common Spark module needs to exist in the long term. For Spark 3.2, I think we should move all the code in iceberg-spark into the 3.2 build.

aokolnychyi · 2021-10-11T20:39:51Z

spark/build.gradle

+}
+
+// add enabled Spark version modules to the build
+project.ext.defaultSparkVersions = "2.4,3.0"


Similar to the question above, can we just keep this part in this particular build.gradle?

Since we have parameterization in settings.gradle, do we really need the this build file once the code is moved into subfolders?

We can probably share the default versions and known versions. But we do need the if statements here to load the correct build files if the projects were defined.

I was confused initially as I did not recognize we still kept the common module. The more I look, the more it seems we should get rid of it.

I moved the known and default Spark version lists into gradle.properties so that they are shared.

As for getting rid of the common module, I'm all for that. I just want to make this as small as possible.

aokolnychyi · 2021-10-11T21:28:07Z

should we continue to keep the existing iceberg-spark module as a common module across all versions

While I am generally against sharing code between any versions of Spark, I am 50/50 here. We could probably have a common module for very basic things but then we won't compile that code against the exact Spark version we will be running with. Since the main development will be done against the most recent version, I'd slightly prefer duplicating the classes as then we won't have to reason how public and stable Spark API, which are used in that common module, are.

The definition is that build version is the actual version we build Spark, and source version is the smallest version that would be forward compatible until we create the next source version.

I'd vote for no compatibility between minor versions, only patch releases. We tried that in the past with 3.0 and 3.1 and it slows down the development and makes the life much harder. We are using Spark APIs that change frequently. I'd say having a module for each minor Spark version would be better. I think there will be major features in each upcoming minor Spark version that we will want to consume ASAP.

aokolnychyi · 2021-10-11T21:44:40Z

@RussellSpitzer @szehon-ho @flyrain, any thoughts?

RussellSpitzer · 2021-10-12T00:34:12Z

settings.gradle

 project(':pig').name = 'iceberg-pig'
 project(':hive-metastore').name = 'iceberg-hive-metastore'
 project(':nessie').name = 'iceberg-nessie'

+String defaultSparkVersions = "2.4,3.0"


Looks like we are defining this as a local variable and as a project property? Also should we store this as a list? Seems a little odd that we split it to turn it into a list immediately after?

We need to look into how to deduplicate this. And we can avoid parsing it as well.

rdblue · 2021-10-12T15:00:03Z

My original thought in #3237 was that we can define the rule that the shared code can continue to live, and any code that cannot build againt all versions should then be refactored to different version modules.

I think that compiling against different versions of Spark is beginning to be a significant risk due to changes across versions. That's only going to get worse when we start adding flags for Scala version as well and continue to add more extensions that touch Spark internals. I don't think it's a good idea to try to refactor common classes out unless they don't have a compile dependency on Spark. And in that case, we can more them to core like the Writer classes.

. . . The intention is again to minimize duplicated code by minimizing the number of version folders.

I think we're actually better off the other way. What we have now does some pretty crazy things to deal with multiple Spark versions and I think we'll end up with cleaner code if we just duplicate it. This is one of the reasons why I think we want these build changes.

In #3237 I had another scalaVersion system property to allow setting a flexible Scala version. Do you think that is something worth adding?

Yes, but as a follow-up. This is just a PR to put a basic structure in place and we will continue to improve it in future commits.

What would be the naming scheme for modules after v3.0?

We don't need to decide this now, since this doesn't introduce new modules yet. I'd like to focus on getting this in with minimal changes since it's already so large!

aokolnychyi · 2021-10-12T16:37:42Z

We should probably wait with merging any other Spark related PRs before this is in.

rdblue · 2021-10-12T17:22:47Z

Yeah, looks like since we're moving files around this hits commit conflicts really quickly. We should try to get consensus around the file moves and structure as soon as possible and then get this in. That's partly why I wanted to make this first commit do as little as possible. We can add more refactoring later, but it's easiest to do this as multiple focused commits with only one big file move!

jackye1995

I don't have any additional questions. I agree that the risk of keeping a common Spark module looks high especially with the Scala extensions, and we can start 3.2 clean without dependency of the common module, and begin to work on getting rid of it completely. And given the current tendency of Spark backwards incompatibility across minor versions, it's good enough to directly have a 1:1 source and build correspondence. Since we need to merge this PR asap to avoid conflicts piling up, I will hold some discussions to subsequent PRs.

jmh.gradle

aokolnychyi

Looks great, @rdblue! Thanks!

I assume the next task would be to get rid of the common Spark module?

rdblue · 2021-10-15T19:46:03Z

Yeah, removing the common module is a good next step. Or just leaving it and moving on to a build for Spark 3.2.

We should also decide how we want to handle 3.1. I'm leaning toward creating a module for it so we can possibly backport support for metadata columns, remove some of the reflection code, and eventually support a newer version of Scala.

aokolnychyi · 2021-10-15T19:50:01Z

A separate 3.1 sounds reasonable to me. It will follow the new naming, right? With the Spark and Scala versions in it?

rdblue · 2021-10-15T19:50:35Z

Yes, I think it should follow new naming conventions so that we can add Scala version to it.

rdblue · 2021-10-15T20:51:25Z

Merged! Thanks for the reviews, @jackye1995 and @aokolnychyi!

rdblue changed the title ~~Build: Move Spark version modules under spark directory.~~ Build: Move Spark version modules under spark directory Oct 8, 2021

github-actions bot added build docs INFRA spark labels Oct 8, 2021

rdblue mentioned this pull request Oct 8, 2021

Spark: refactor Spark modules to support multiple versions #3237

Closed

aokolnychyi reviewed Oct 11, 2021

View reviewed changes

jmh.gradle Outdated Show resolved Hide resolved

aokolnychyi reviewed Oct 11, 2021

View reviewed changes

settings.gradle Show resolved Hide resolved

aokolnychyi reviewed Oct 11, 2021

View reviewed changes

RussellSpitzer reviewed Oct 12, 2021

View reviewed changes

jackye1995 approved these changes Oct 12, 2021

View reviewed changes

rdblue added 5 commits October 15, 2021 10:21

Build: Move Spark version modules under spark directory.

b80834d

Re-enable JMH.

fa24d6f

Adding missing license headers.

815d237

Share defaultSparkVerions and knownSparkVersions via gradle.properties.

65f6796

Fix JMH config.

d2e3269

rdblue force-pushed the refactor-spark-build branch from 3096147 to d2e3269 Compare October 15, 2021 19:27

aokolnychyi reviewed Oct 15, 2021

View reviewed changes

jmh.gradle Outdated Show resolved Hide resolved

Remove unused references to knownSparkVersions.

f9a6149

aokolnychyi approved these changes Oct 15, 2021

View reviewed changes

rdblue merged commit 811af43 into apache:master Oct 15, 2021

nastra mentioned this pull request Oct 18, 2021

Fix benchmark execution & docs #3302

Merged

ajantha-bhat added a commit to ajantha-bhat/iceberg that referenced this pull request Oct 20, 2021

update README.md after apache#3256 change

045354c

rdblue pushed a commit that referenced this pull request Oct 20, 2021

Docs: Update README.md after #3256 (#3324)

fd3ddaf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build: Move Spark version modules under spark directory #3256

Build: Move Spark version modules under spark directory #3256

rdblue commented Oct 8, 2021

jackye1995 commented Oct 11, 2021

aokolnychyi Oct 11, 2021

jackye1995 Oct 11, 2021

rdblue Oct 11, 2021 •

edited

Loading

aokolnychyi Oct 12, 2021

aokolnychyi Oct 11, 2021

aokolnychyi Oct 11, 2021

aokolnychyi Oct 11, 2021

rdblue Oct 11, 2021

rdblue Oct 11, 2021

aokolnychyi Oct 11, 2021

aokolnychyi Oct 11, 2021

rdblue Oct 11, 2021

aokolnychyi Oct 12, 2021

rdblue Oct 15, 2021

aokolnychyi commented Oct 11, 2021 •

edited

Loading

aokolnychyi commented Oct 11, 2021

RussellSpitzer Oct 12, 2021

rdblue Oct 12, 2021

rdblue commented Oct 12, 2021 •

edited

Loading

aokolnychyi commented Oct 12, 2021

rdblue commented Oct 12, 2021

jackye1995 left a comment

aokolnychyi left a comment

rdblue commented Oct 15, 2021

aokolnychyi commented Oct 15, 2021

rdblue commented Oct 15, 2021

rdblue commented Oct 15, 2021

Build: Move Spark version modules under spark directory #3256

Build: Move Spark version modules under spark directory #3256

Conversation

rdblue commented Oct 8, 2021

jackye1995 commented Oct 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi commented Oct 11, 2021 • edited Loading

aokolnychyi commented Oct 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Oct 12, 2021 • edited Loading

aokolnychyi commented Oct 12, 2021

rdblue commented Oct 12, 2021

jackye1995 left a comment

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

rdblue commented Oct 15, 2021

aokolnychyi commented Oct 15, 2021

rdblue commented Oct 15, 2021

rdblue commented Oct 15, 2021

rdblue Oct 11, 2021 •

edited

Loading

aokolnychyi commented Oct 11, 2021 •

edited

Loading

rdblue commented Oct 12, 2021 •

edited

Loading