[SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6 #22615

srowen · 2018-10-02T18:39:00Z

What changes were proposed in this pull request?

Remove Hadoop 2.6 references and make 2.7 the default.
Obviously, this is for master/3.0.0 only.
After this we can also get rid of the separate test jobs for Hadoop 2.6.

How was this patch tested?

Existing tests

docs/building-spark.md

SparkQA · 2018-10-02T23:48:37Z

Test build #96867 has finished for PR 22615 at commit 3b313bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Sean, for a reminder, https://github.com/apache/spark/blob/master/dev/appveyor-install-dependencies.ps1#L98 should be matched to https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1

srowen · 2018-10-03T09:50:40Z

Yep, already updated in the first commit actually. I'm not sure if that's why appveyor failed here. Let's see on another test.

SparkQA · 2018-10-03T14:41:18Z

Test build #96894 has finished for PR 22615 at commit 77a70a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dev/appveyor-install-dependencies.ps1

SparkQA · 2018-10-04T14:13:54Z

Test build #96930 has finished for PR 22615 at commit fbeb4df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hadoop-cloud/pom.xml

SparkQA · 2018-10-04T20:02:09Z

Test build #96942 has finished for PR 22615 at commit 4f368b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dev/create-release/release-build.sh

SparkQA · 2018-10-05T06:46:25Z

Test build #96961 has finished for PR 22615 at commit 1f16631.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-10-05T18:37:49Z

dev/create-release/release-build.sh

@@ -288,7 +290,9 @@ if [[ "$1" == "package" ]]; then
  declare -A BINARY_PKGS_EXTRA
  BINARY_PKGS_EXTRA["hadoop2.7"]="withpip"
  if ! is_dry_run; then
-    BINARY_PKGS_EXTRA["hadoop2.6"]="withr"
+    if [[ $SPARK_VERSION < "3.0." ]]; then
+      BINARY_PKGS_EXTRA["hadoop2.6"]="withr"


Hmmm, now you're not building the R packaging on 3.0.

The old build was admittedly a bit odd. It only supported one "extra" arg so it build 2.6+R and 2.7+PIP. It seems we need to change that now...

I see. Let me try to rewrite the script to support two args and attach these both to 2.7

vanzin · 2018-10-05T18:39:15Z

What is the change in app-20180109111548-0000 about? It's hard to see in the diff and I'm a little surprised you needed to touch it.

srowen · 2018-10-05T18:43:35Z

That app-... file had classpath-like references to hadoop-...-2.6.5 jars. It may not matter but I updated them. That should be the only difference

vanzin · 2018-10-05T18:45:54Z

had classpath-like references to hadoop-...-2.6.5

It also has references to a bunch of other old stuff; I don't think there's a need to change it.

srowen

I reverted the spark-events test file too.

srowen · 2018-10-06T16:01:05Z

dev/create-release/release-build.sh

    BUILD_PACKAGE=$3
    SCALA_VERSION=$4

+    if [[ $BUILD_PACKAGE == *"withpip"* ]]; then


@vanzin what do you think of this approach? It simplifies the logic below too, avoiding repeating the main build step 3 times.

Looks fine. Using wildcards is a little weird but I guess that's the cleanest way in bash.

But shouldn't you initialize PIP_FLAG and R_FLAG to empty before these checks?

one caveat is I'm not sure we have tested building both python and R in "one build".

this could be a good thing but if I recall the R build changes some of the binary files under R that gets shipped in the "source release" (these are required R object file)

SparkQA · 2018-10-06T21:08:02Z

Test build #97054 has finished for PR 22615 at commit 9efb76c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-08T11:26:46Z

Test build #97105 has finished for PR 22615 at commit 9efb76c.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-10-08T19:27:11Z

Test build #97116 has finished for PR 22615 at commit fb2c90d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Have you pinged @shaneknapp about removing the 2.6 jobs for the master branch before pushing this?

vanzin · 2018-10-08T22:05:01Z

dev/create-release/release-build.sh

    BUILD_PACKAGE=$3
    SCALA_VERSION=$4

+    if [[ $BUILD_PACKAGE == *"withpip"* ]]; then


Looks fine. Using wildcards is a little weird but I guess that's the cleanest way in bash.

But shouldn't you initialize PIP_FLAG and R_FLAG to empty before these checks?

srowen · 2018-10-08T22:40:33Z

I guess we've just pinged @shaneknapp ! But I figured the jobs would simply fail and could be removed at leisure.

Yes, this mechanism is a little weird but may be the simplest thing here. I can add explicit init of those two flags to an empty string but an unset var is just an empty string anyway.

shaneknapp · 2018-10-08T22:45:49Z

consider me pinged. ;)

i will need to do some refactoring in the jenkins job builder configs for this, so we'll definitely need to coordinate before this is merged.

most likely i won't have much time until next week (risecamp will be taking all of my time wed-fri), but i'll see if i can't at least get an initial PR on this stuff by EOD tomorrow (oct 9th).

@JoshRosen for a heads up on the forthcoming job config builder changes.

shaneknapp · 2018-10-08T22:46:48Z

@srowen sure, manually removing the failing jobs is one option... but since we auto-generate the job configs, any time we add a new branch they'll come back.

i'd much rather do this the right way. :)

vanzin · 2018-10-08T22:47:01Z

an unset var is just an empty string anyway.

The issue is that if you call that function multiple times the variables might be set by a previous call.

shaneknapp · 2018-10-08T23:26:06Z

https://github.com/databricks/spark-jenkins-configurations/pull/47

vanzin · 2018-10-08T23:29:31Z

Hmmm... just want to raise a possible issue that maybe, just maybe, we should be hosting those jenkins configs in a repository that is owned by the ASF and writable by all Spark committers.

Or even as a directory under the Spark repo itself (and use them always from master).

Just a thought.

shaneknapp · 2018-10-08T23:32:26Z

@vanzin i'm not opposed to hosting these configs somewhere else. @JoshRosen did this a few years back just to "get shit done"...

i'd be leery of putting this in to the main spark repo, however, as only a very, very, very small subset of people (consisting mostly of myself) should actually ever touch this stuff.

srowen · 2018-10-08T23:43:32Z

Yeah this does need to be in a public repo. apache/spark-jenkins-configurations or something. We can ask INFRA to create them. But, I'm not against just putting them in dev/ or something in the main repo. It's not much right? and we already host all the release scripts there which maybe 5 people are interested in right now.

shaneknapp · 2018-10-09T01:04:57Z

@srowen fair 'nuf... i'll create a jira for this tomorrow and we can hash out final design shite there (rather than overloading this PR). :)

HyukjinKwon · 2018-10-09T03:25:48Z

I want to see the configurations ..

SparkQA · 2018-10-09T04:41:48Z

Test build #97137 has finished for PR 22615 at commit 7392cf0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-10-09T06:15:00Z

I think we all know enough to not to make changes (merge changes) to these config, should be safe.

felixcheung · 2018-10-09T06:17:56Z

dev/create-release/release-build.sh

    BUILD_PACKAGE=$3
    SCALA_VERSION=$4

+    if [[ $BUILD_PACKAGE == *"withpip"* ]]; then


one caveat is I'm not sure we have tested building both python and R in "one build".

this could be a good thing but if I recall the R build changes some of the binary files under R that gets shipped in the "source release" (these are required R object file)

felixcheung · 2018-10-09T06:18:34Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

@@ -71,7 +71,7 @@ class HadoopTableReader(

  // Hadoop honors "mapreduce.job.maps" as hint,
  // but will ignore when mapreduce.jobtracker.address is "local".
-  // https://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
+  // https://hadoop.apache.org/docs/r2.7.6/hadoop-mapreduce-client/hadoop-mapreduce-client-core/


how come this is 2.7.6 and not 2.7.3 like others?

We generally want to update to the latest Hadoop 2.7.x, but had an issue with 2.7.7 and reverted. Here I wanted to go ahead and link to the latest anyway, but for some reason this doc was only in the 2.7.6 docs, not 2.7.7. I doubt the doc will actually vary non-trivially from version to version, but wanted to at least point at a 2.7.x version.

felixcheung · 2018-10-09T06:20:19Z

docs/index.md

@@ -30,9 +30,6 @@ Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{s
 uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version
 ({{site.SCALA_BINARY_VERSION}}.x).

-Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0.
-Support for Scala 2.10 was removed as of 2.3.0.


so we are not going to mention supported hadoop version?

Now that we are onto 3.0, I figured we didn't need to keep documenting how version 2.2 and 2.3 worked. I also felt that the particular Hadoop version was only an issue in the distant past, when we were trying to support the odd world of mutually incompatible 2.x releases before 2.2. Now, it's no more of a high level issue than anything else. Indeed we might even just build vs Hadoop 3.x in the end and de-emphasize dependence on a particular version of Hadoop. But for now I just removed this note.

srowen · 2018-10-09T14:40:17Z

@felixcheung regarding building PIP and R in one release, yeah I was wondering that too. Ideally it would just be one. If the build changes only affect the source release, that's OK, as this is attached to a binary release, right? I suspected there wouldn't actually be any cross-over between the Python and R packaging in the binary release.

shaneknapp · 2018-10-09T16:24:37Z

I want to see the configurations .. they're just some absolutely breath-taking bits of yaml to define all of

the spark jenkins jobs. aka: really nothing that exciting. ;)

srowen · 2018-10-10T17:24:32Z

I tried a release build that causes --pip and --r to be set, and the result looked OK. Both pyspark and R packages built and seemed normal. The source build worked too and comes before binary builds, so I don't think it can be affected. I will go ahead and merge this, I think.

vanzin · 2018-10-10T17:25:50Z

Fine with me if the jenkins stuff is sorted out.

shaneknapp · 2018-10-10T19:00:50Z

i haven't had a chance to do any of the jenkins stuff... after being sidetracked by the conversation to move the configs to the spark repo, plus planning for our big event that starts tomorrow, plus zomgmeetings all day today, work won't be able to start until early next week.

srowen · 2018-10-10T19:09:05Z

Merged to master. Note that the master hadoop 2.6 job will fail immediately now, so ignore it. On the upside ... this job already wont' take much of any time from the Jenkins cluster.

shaneknapp · 2018-10-18T22:19:13Z

ok just to revisit this: i'm going to push out the new jenkins jobs configs now, and not gate on moving these to the spark repo.

shaneknapp · 2018-10-18T22:29:45Z

also, i will be disabling the following jobs:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.6/
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6/
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.6/

## What changes were proposed in this pull request? Remove Hadoop 2.6 references and make 2.7 the default. Obviously, this is for master/3.0.0 only. After this we can also get rid of the separate test jobs for Hadoop 2.6. ## How was this patch tested? Existing tests Closes apache#22615 from srowen/SPARK-25016. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

….6 in Jenkins's test script ### What changes were proposed in this pull request? This PR proposes to remove the leftover. After #22615, we don't have Hadoop 2.6 profile anymore in master. ### Why are the changes needed? Using "test-hadoop2.6" against master branch in a PR wouldn't work. ### Does this PR introduce any user-facing change? No (dev only). ### How was this patch tested? Manually tested at #26707 and Jenkins build will test. Without this fix, and hadoop2.6 in the pr title, it shows as below: ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.6 in the list. Valid options are dict_keys(['hadoop2.7', 'hadoop3.2']) Attempting to post to Github... ``` Closes #26708 from HyukjinKwon/SPARK-25016. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

….6 in Jenkins's test script ### What changes were proposed in this pull request? This PR proposes to remove the leftover. After apache#22615, we don't have Hadoop 2.6 profile anymore in master. ### Why are the changes needed? Using "test-hadoop2.6" against master branch in a PR wouldn't work. ### Does this PR introduce any user-facing change? No (dev only). ### How was this patch tested? Manually tested at apache#26707 and Jenkins build will test. Without this fix, and hadoop2.6 in the pr title, it shows as below: ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.6 in the list. Valid options are dict_keys(['hadoop2.7', 'hadoop3.2']) Attempting to post to Github... ``` Closes apache#26708 from HyukjinKwon/SPARK-25016. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

dongjoon-hyun reviewed Oct 2, 2018

View reviewed changes

docs/building-spark.md Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 3, 2018

View reviewed changes

dev/appveyor-install-dependencies.ps1 Outdated Show resolved Hide resolved

viirya reviewed Oct 4, 2018

View reviewed changes

hadoop-cloud/pom.xml Outdated Show resolved Hide resolved

srowen commented Oct 5, 2018

View reviewed changes

dev/create-release/release-build.sh Show resolved Hide resolved

vanzin reviewed Oct 5, 2018

View reviewed changes

srowen force-pushed the SPARK-25016 branch from 1f16631 to 9efb76c Compare October 6, 2018 16:00

srowen commented Oct 6, 2018

View reviewed changes

srowen added 2 commits October 8, 2018 09:43

Remove Hadoop 2.6 references and make 2.7 the default

a94287e

Build pip, R packages with hadoop-2.7 build now that 2.6 is removed

fb2c90d

srowen force-pushed the SPARK-25016 branch from 9efb76c to fb2c90d Compare October 8, 2018 14:43

vanzin reviewed Oct 8, 2018

View reviewed changes

Reinit new _FLAG vars on each loop

7392cf0

felixcheung reviewed Oct 9, 2018

View reviewed changes

asfgit closed this in 80813e1 Oct 10, 2018

srowen deleted the SPARK-25016 branch October 10, 2018 22:29

HyukjinKwon mentioned this pull request Nov 29, 2019

[SPARK-25016][INFRA][FOLLOW-UP] Remove leftover for dropping Hadoop 2.6 in Jenkins's test script #26708

Closed

[SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6 #22615

[SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6 #22615

Uh oh!

Conversation

srowen commented Oct 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Uh oh!

SparkQA commented Oct 2, 2018

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 3, 2018

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Oct 5, 2018

Uh oh!

srowen commented Oct 5, 2018

Uh oh!

vanzin commented Oct 5, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 8, 2018

Uh oh!

shaneknapp commented Oct 8, 2018

Uh oh!

shaneknapp commented Oct 8, 2018

Uh oh!

vanzin commented Oct 8, 2018

Uh oh!

shaneknapp commented Oct 8, 2018

Uh oh!

vanzin commented Oct 8, 2018

Uh oh!

shaneknapp commented Oct 8, 2018

Uh oh!

srowen commented Oct 8, 2018

Uh oh!

shaneknapp commented Oct 9, 2018

Uh oh!

HyukjinKwon commented Oct 9, 2018

Uh oh!

SparkQA commented Oct 9, 2018

Uh oh!

HyukjinKwon left a comment •

edited

Loading

shaneknapp commented Oct 18, 2018 •

edited

Loading