Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build: Move Spark version modules under spark directory #3256

Merged
merged 6 commits into from
Oct 15, 2021

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Oct 8, 2021

This is another option for refactoring the Spark build. This is similar to #3237, but it also moves Spark projects into separate build.gradle files and allows building multiple Spark versions in parallel.

This removes Spark 3.1 testing from the Spark 3.0 project because 3.0 will be copied to produce a 3.1 build. Otherwise, Spark builds are unchanged to keep changes small.

There should also be further refactoring after these build changes to copy the contents of iceberg-spark into each version to ensure all classes are compiled against the correct version of Spark and Scala.

This is the new layout:

├── spark
│   ├── v2.4
│   │   ├── spark2
│   │   └── spark-runtime
│   └── v3.0
│       ├── spark3
│       │   └── src
│       ├── spark3-extensions
│       │   └── src
│       └── spark3-runtime
│           └── src

Work to do after this:

  • Refactor iceberg-spark and move into version-specific modules
  • Add a v3.1 build
  • Refactoring to remove modules that are no longer needed, like spark3-extensions
  • Apply similar refactoring to jmh.gradle

@rdblue rdblue changed the title Build: Move Spark version modules under spark directory. Build: Move Spark version modules under spark directory Oct 8, 2021
@jackye1995
Copy link
Contributor

Thanks for the PR! There are a few aspects I'd like to discuss:

  1. should we continue to keep the existing iceberg-spark module as a common module across all versions

My original thought in #3237 was that we can define the rule that the shared code can continue to live, and any code that cannot build againt all versions should then be refactored to different version modules. The intention is to minimize duplicated code across version modules, but I don't have enough context yet if this is worth the effort, maybe after refactoring the current code to build across Spark 2 and 3, we don't have much shared code anyway.

  1. should we have v3.1 if v3.0 and v3.1 can all be built against v3.0

In #3237 I have the concept of a build version and source version.

The definition is that build version is the actual version we build Spark, and source version is the smallest version that would be forward compatible until we create the next source version.

So in this case, we will have source version v2.4 that can build Spark 2.4.x, source version v3.0 that can build Spark 3.0.x and 3.1.x. And we will have a v3.2 for build versions v3.2.x and forward until we cannot build against v3.2 source.

The intention is again to minimize duplicated code by minimizing the number of version folders.

  1. How would we handle Scala version in this architecture?

In #3237 I had another scalaVersion system property to allow setting a flexible Scala version. Do you think that is something worth adding? It doesn't need to be refactored in this PR, just want to know what you think about Scala versioning.

  1. What would be the naming scheme for modules after v3.0?

This is just something not yet shown in this PR, for example for v3.2, would the package names be something like :iceberg-spark:spark32-runtime, or :iceberg-spark:spark3-2-runtime, or something else?

Also combining with Scala version consideration, if we support that, should we also publish artifacts for all Scala versions we support like what Spark does?

throw new GradleException("Found unsupported Spark versions: " + (sparkVersions - knownSparkVersions))
}

if (sparkVersions.contains("3.0")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need to add 3.1 module? Seems like we should be fine with 3.0 only unless someone in the community wants to leverage some 3.1 APIs?

I'd probably just add 3.2 next and focus on that version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for just add 3.2 next, that's why I was trying to introduce the concept of build and src versions.

Copy link
Contributor Author

@rdblue rdblue Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about this a bit more. I think that we should be able to run the tests and integration tests against 3.1. I think the easy way to do that is to introduce a 3.1 build directory. Maybe it can share source with 3.0 though? I just want to avoid having extra ways of calling tests for just one build of Spark. Ideally, we'd add on set of actions to test each version of Spark individually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine either way.

*/

project(':iceberg-spark') {
configurations.all {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all these dependencies if the same is also defined for each specific version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it supposed to include common configs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it just temporary and will be removed once we move the code into subfolders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the individual builds, other than to remove the Spark 3.1 parts of the Spark 3.0 build. We can refactor these as needed once the change is done.

But, I still think it's a good trade-off to avoid trying to share code. The version numbers may change across releases, or maybe Spark starts shading something and we no longer need to ship it. I think we should not worry about what's in some common config and just update the individual builds as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: I don't think that this common Spark module needs to exist in the long term. For Spark 3.2, I think we should move all the code in iceberg-spark into the 3.2 build.

}

// add enabled Spark version modules to the build
project.ext.defaultSparkVersions = "2.4,3.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the question above, can we just keep this part in this particular build.gradle?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have parameterization in settings.gradle, do we really need the this build file once the code is moved into subfolders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably share the default versions and known versions. But we do need the if statements here to load the correct build files if the projects were defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused initially as I did not recognize we still kept the common module. The more I look, the more it seems we should get rid of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the known and default Spark version lists into gradle.properties so that they are shared.

As for getting rid of the common module, I'm all for that. I just want to make this as small as possible.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Oct 11, 2021

  1. should we continue to keep the existing iceberg-spark module as a common module across all versions

While I am generally against sharing code between any versions of Spark, I am 50/50 here. We could probably have a common module for very basic things but then we won't compile that code against the exact Spark version we will be running with. Since the main development will be done against the most recent version, I'd slightly prefer duplicating the classes as then we won't have to reason how public and stable Spark API, which are used in that common module, are.

The definition is that build version is the actual version we build Spark, and source version is the smallest version that would be forward compatible until we create the next source version.

I'd vote for no compatibility between minor versions, only patch releases. We tried that in the past with 3.0 and 3.1 and it slows down the development and makes the life much harder. We are using Spark APIs that change frequently. I'd say having a module for each minor Spark version would be better. I think there will be major features in each upcoming minor Spark version that we will want to consume ASAP.

@aokolnychyi
Copy link
Contributor

@RussellSpitzer @szehon-ho @flyrain, any thoughts?

settings.gradle Outdated
project(':pig').name = 'iceberg-pig'
project(':hive-metastore').name = 'iceberg-hive-metastore'
project(':nessie').name = 'iceberg-nessie'

String defaultSparkVersions = "2.4,3.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are defining this as a local variable and as a project property? Also should we store this as a list? Seems a little odd that we split it to turn it into a list immediately after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to look into how to deduplicate this. And we can avoid parsing it as well.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 12, 2021

My original thought in #3237 was that we can define the rule that the shared code can continue to live, and any code that cannot build againt all versions should then be refactored to different version modules.

I think that compiling against different versions of Spark is beginning to be a significant risk due to changes across versions. That's only going to get worse when we start adding flags for Scala version as well and continue to add more extensions that touch Spark internals. I don't think it's a good idea to try to refactor common classes out unless they don't have a compile dependency on Spark. And in that case, we can more them to core like the Writer classes.

. . . The intention is again to minimize duplicated code by minimizing the number of version folders.

I think we're actually better off the other way. What we have now does some pretty crazy things to deal with multiple Spark versions and I think we'll end up with cleaner code if we just duplicate it. This is one of the reasons why I think we want these build changes.

In #3237 I had another scalaVersion system property to allow setting a flexible Scala version. Do you think that is something worth adding?

Yes, but as a follow-up. This is just a PR to put a basic structure in place and we will continue to improve it in future commits.

What would be the naming scheme for modules after v3.0?

We don't need to decide this now, since this doesn't introduce new modules yet. I'd like to focus on getting this in with minimal changes since it's already so large!

@aokolnychyi
Copy link
Contributor

We should probably wait with merging any other Spark related PRs before this is in.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 12, 2021

Yeah, looks like since we're moving files around this hits commit conflicts really quickly. We should try to get consensus around the file moves and structure as soon as possible and then get this in. That's partly why I wanted to make this first commit do as little as possible. We can add more refactoring later, but it's easiest to do this as multiple focused commits with only one big file move!

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any additional questions. I agree that the risk of keeping a common Spark module looks high especially with the Scala extensions, and we can start 3.2 clean without dependency of the common module, and begin to work on getting rid of it completely. And given the current tendency of Spark backwards incompatibility across minor versions, it's good enough to directly have a 1:1 source and build correspondence. Since we need to merge this PR asap to avoid conflicts piling up, I will hold some discussions to subsequent PRs.

@rdblue rdblue force-pushed the refactor-spark-build branch from 3096147 to d2e3269 Compare October 15, 2021 19:27
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @rdblue! Thanks!

I assume the next task would be to get rid of the common Spark module?

@rdblue
Copy link
Contributor Author

rdblue commented Oct 15, 2021

Yeah, removing the common module is a good next step. Or just leaving it and moving on to a build for Spark 3.2.

We should also decide how we want to handle 3.1. I'm leaning toward creating a module for it so we can possibly backport support for metadata columns, remove some of the reflection code, and eventually support a newer version of Scala.

@aokolnychyi
Copy link
Contributor

A separate 3.1 sounds reasonable to me. It will follow the new naming, right? With the Spark and Scala versions in it?

@rdblue
Copy link
Contributor Author

rdblue commented Oct 15, 2021

Yes, I think it should follow new naming conventions so that we can add Scala version to it.

@rdblue rdblue merged commit 811af43 into apache:master Oct 15, 2021
@rdblue
Copy link
Contributor Author

rdblue commented Oct 15, 2021

Merged! Thanks for the reviews, @jackye1995 and @aokolnychyi!

ajantha-bhat added a commit to ajantha-bhat/iceberg that referenced this pull request Oct 20, 2021
rdblue pushed a commit that referenced this pull request Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants