Skip to content

Commit

Permalink
[SPARK-30601][BUILD] Add a Google Maven Central as a primary repository
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This PR proposes to address four things. Three issues and fixes were a bit mixed so this PR sorts it out. See also http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html for the discussion in the mailing list.

1. Add the Google Maven Central mirror (GCS) as a primary repository. This will not only help development more stable but also in order to make Github Actions build (where it is always required to download jars) stable. In case of Jenkins PR builder, it wouldn't be affected too much as it uses the pre-downloaded jars under `.m2`.

    - Google Maven Central seems stable for heavy workload but not synced very quickly (e.g., new release is missing)
    - Maven Central (default) seems less stable but synced quickly.

    We already added this GCS mirror as a default additional remote repository at SPARK-29175. So I don't see an issue to add it as a repo.
    https://github.com/apache/spark/blob/abf759a91e01497586b8bb6b7a314dd28fd6cff1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2111-L2118

2. Currently, we have the hard-corded repository in [`sbt-pom-reader`](https://github.com/JoshRosen/sbt-pom-reader/blob/v1.0.0-spark/src/main/scala/com/typesafe/sbt/pom/MavenPomResolver.scala#L32) and this seems overwriting Maven's existing resolver by the same ID `central` with `http://` when initially the pom file is ported into SBT instance. This uses `http://` which latently Maven Central disallowed (see #27242)

    My speculation is that we just need to be able to load plugin and let it convert POM to SBT instance with another fallback repo. After that, it _seems_ using `central` with `https` properly. See also #27307 (comment).

    I double checked that we use `https` properly from the SBT build as well:

    ```
    [debug] downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom ...
    [debug] 	public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom
    [debug] 	public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom.sha1
    ```

    This was fixed by adding the same repo (#27281), `central_without_mirror`, which is a bit awkward. Instead, this PR adds GCS as a main repo, and community Maven central as a fallback repo. So, presumably the community Maven central repo is used when the plugin is loaded as a fallback.

3. While I am here, I fix another issue. Github Action at #27279 is being failed. The reason seems to be scalafmt 1.0.3 is in Maven central but not in GCS.

    ```
    org.apache.maven.plugin.PluginResolutionException: Plugin org.antipathy:mvn-scalafmt_2.12:1.0.3 or one of its dependencies could not be resolved: Could not find artifact org.antipathy:mvn-scalafmt_2.12:jar:1.0.3 in google-maven-central (https://maven-central.storage-download.googleapis.com/repos/central/data/)
        at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolve     (DefaultPluginDependenciesResolver.java:131)
    ```

   `mvn-scalafmt` exists in Maven central:

    ```bash
    $ curl https://repo.maven.apache.org/maven2/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom
    ```

    ```xml
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
        <modelVersion>4.0.0</modelVersion>
        ...
    ```

    whereas not in GCS mirror:

    ```bash
    $ curl https://maven-central.storage-download.googleapis.com/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom
    ```
    ```xml
    <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: maven-central/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom</Details></Error>%
    ```

    In this PR, simply make both repos accessible by adding to `pluginRepositories`.

4. Remove the workarounds in Github Actions to switch mirrors because now we have same repos in the same order (Google Maven Central first, and Maven Central second)

### Why are the changes needed?

To make the build and Github Action more stable.

### Does this PR introduce any user-facing change?

No, dev only change.

### How was this patch tested?

I roughly checked local and PR against my fork (HyukjinKwon#2 and HyukjinKwon#3).

Closes #27307 from HyukjinKwon/SPARK-30572.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  • Loading branch information
HyukjinKwon committed Jan 23, 2020
1 parent db528e4 commit cd9ccdc
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 16 deletions.
8 changes: 0 additions & 8 deletions .github/workflows/master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,14 +66,6 @@ jobs:
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN"
export MAVEN_CLI_OPTS="--no-transfer-progress"
mkdir -p ~/.m2
# `Maven Central` is too flaky in terms of downloading artifacts in `GitHub Action` environment.
# `Google Maven Central Mirror` is too slow in terms of sycing upstream. To get the best combination,
# 1) we set `Google Maven Central` as a mirror of `central` in `GitHub Action` environment only.
# 2) we duplicates `Maven Central` in pom.xml with ID `central_without_mirror`.
# In other words, in GitHub Action environment, `central` is mirrored by `Google Maven Central` first.
# If `Google Maven Central` doesn't provide the artifact due to its slowness, `central_without_mirror` will be used.
# Note that we aim to achieve the above while keeping the existing behavior of non-`GitHub Action` environment unchanged.
echo "<settings><mirrors><mirror><id>google-maven-central</id><name>GCS Maven Central mirror</name><url>https://maven-central.storage-download.googleapis.com/repos/central/data/</url><mirrorOf>central</mirrorOf></mirror></mirrors></settings>" > ~/.m2/settings.xml
./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pmesos -Pkubernetes -Phive -P${{ matrix.hive }} -Phive-thriftserver -P${{ matrix.hadoop }} -Phadoop-cloud -Djava.version=${{ matrix.java }} install
rm -rf ~/.m2/repository/org/apache/spark
Expand Down
32 changes: 24 additions & 8 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -246,10 +246,13 @@
</properties>
<repositories>
<repository>
<id>central</id>
<!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution -->
<name>Maven Repository</name>
<url>https://repo.maven.apache.org/maven2</url>
<id>gcs-maven-central-mirror</id>
<!--
Google Mirror of Maven Central, placed first so that it's used instead of flaky Maven Central.
See https://storage-download.googleapis.com/maven-central/index.html
-->
<name>GCS Maven Central mirror</name>
<url>https://maven-central.storage-download.googleapis.com/repos/central/data/</url>
<releases>
<enabled>true</enabled>
</releases>
Expand All @@ -258,12 +261,10 @@
</snapshots>
</repository>
<repository>
<id>central_without_mirror</id>
<!--
This is used as a fallback when a mirror to `central` fail.
For example, when we use Google Maven Central in GitHub Action as a mirror of `central`,
this will be used when Google Maven Central is out of sync due to its late sync cycle.
This is used as a fallback when the first try fails.
-->
<id>central</id>
<name>Maven Repository</name>
<url>https://repo.maven.apache.org/maven2</url>
<releases>
Expand All @@ -275,6 +276,21 @@
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>gcs-maven-central-mirror</id>
<!--
Google Mirror of Maven Central, placed first so that it's used instead of flaky Maven Central.
See https://storage-download.googleapis.com/maven-central/index.html
-->
<name>GCS Maven Central mirror</name>
<url>https://maven-central.storage-download.googleapis.com/repos/central/data/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</pluginRepository>
<pluginRepository>
<id>central</id>
<url>https://repo.maven.apache.org/maven2</url>
Expand Down
3 changes: 3 additions & 0 deletions project/SparkBuild.scala
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,9 @@ object SparkBuild extends PomBuild {

// Override SBT's default resolvers:
resolvers := Seq(
// Google Mirror of Maven Central, placed first so that it's used instead of flaky Maven Central.
// See https://storage-download.googleapis.com/maven-central/index.html for more info.
"gcs-maven-central-mirror" at "https://maven-central.storage-download.googleapis.com/repos/central/data/",
DefaultMavenRepository,
Resolver.mavenLocal,
Resolver.file("local", file(Path.userHome.absolutePath + "/.ivy2/local"))(Resolver.ivyStylePatterns)
Expand Down

0 comments on commit cd9ccdc

Please sign in to comment.