Skip to content

[SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6 #22615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Oct 2, 2018

What changes were proposed in this pull request?

Remove Hadoop 2.6 references and make 2.7 the default.
Obviously, this is for master/3.0.0 only.
After this we can also get rid of the separate test jobs for Hadoop 2.6.

How was this patch tested?

Existing tests

@SparkQA
Copy link

SparkQA commented Oct 2, 2018

Test build #96867 has finished for PR 22615 at commit 3b313bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen
Copy link
Member Author

srowen commented Oct 3, 2018

Yep, already updated in the first commit actually. I'm not sure if that's why appveyor failed here. Let's see on another test.

@SparkQA
Copy link

SparkQA commented Oct 3, 2018

Test build #96894 has finished for PR 22615 at commit 77a70a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 4, 2018

Test build #96930 has finished for PR 22615 at commit fbeb4df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 4, 2018

Test build #96942 has finished for PR 22615 at commit 4f368b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96961 has finished for PR 22615 at commit 1f16631.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -288,7 +290,9 @@ if [[ "$1" == "package" ]]; then
declare -A BINARY_PKGS_EXTRA
BINARY_PKGS_EXTRA["hadoop2.7"]="withpip"
if ! is_dry_run; then
BINARY_PKGS_EXTRA["hadoop2.6"]="withr"
if [[ $SPARK_VERSION < "3.0." ]]; then
BINARY_PKGS_EXTRA["hadoop2.6"]="withr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, now you're not building the R packaging on 3.0.

The old build was admittedly a bit odd. It only supported one "extra" arg so it build 2.6+R and 2.7+PIP. It seems we need to change that now...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let me try to rewrite the script to support two args and attach these both to 2.7

@vanzin
Copy link
Contributor

vanzin commented Oct 5, 2018

What is the change in app-20180109111548-0000 about? It's hard to see in the diff and I'm a little surprised you needed to touch it.

@srowen
Copy link
Member Author

srowen commented Oct 5, 2018

That app-... file had classpath-like references to hadoop-...-2.6.5 jars. It may not matter but I updated them. That should be the only difference

@vanzin
Copy link
Contributor

vanzin commented Oct 5, 2018

had classpath-like references to hadoop-...-2.6.5

It also has references to a bunch of other old stuff; I don't think there's a need to change it.

Copy link
Member Author

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the spark-events test file too.

BUILD_PACKAGE=$3
SCALA_VERSION=$4

if [[ $BUILD_PACKAGE == *"withpip"* ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin what do you think of this approach? It simplifies the logic below too, avoiding repeating the main build step 3 times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Using wildcards is a little weird but I guess that's the cleanest way in bash.

But shouldn't you initialize PIP_FLAG and R_FLAG to empty before these checks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one caveat is I'm not sure we have tested building both python and R in "one build".

this could be a good thing but if I recall the R build changes some of the binary files under R that gets shipped in the "source release" (these are required R object file)

@SparkQA
Copy link

SparkQA commented Oct 6, 2018

Test build #97054 has finished for PR 22615 at commit 9efb76c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97105 has finished for PR 22615 at commit 9efb76c.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97116 has finished for PR 22615 at commit fb2c90d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you pinged @shaneknapp about removing the 2.6 jobs for the master branch before pushing this?

BUILD_PACKAGE=$3
SCALA_VERSION=$4

if [[ $BUILD_PACKAGE == *"withpip"* ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Using wildcards is a little weird but I guess that's the cleanest way in bash.

But shouldn't you initialize PIP_FLAG and R_FLAG to empty before these checks?

@srowen
Copy link
Member Author

srowen commented Oct 8, 2018

I guess we've just pinged @shaneknapp ! But I figured the jobs would simply fail and could be removed at leisure.

Yes, this mechanism is a little weird but may be the simplest thing here. I can add explicit init of those two flags to an empty string but an unset var is just an empty string anyway.

@shaneknapp
Copy link
Contributor

consider me pinged. ;)

i will need to do some refactoring in the jenkins job builder configs for this, so we'll definitely need to coordinate before this is merged.

most likely i won't have much time until next week (risecamp will be taking all of my time wed-fri), but i'll see if i can't at least get an initial PR on this stuff by EOD tomorrow (oct 9th).

@JoshRosen for a heads up on the forthcoming job config builder changes.

@shaneknapp
Copy link
Contributor

@srowen sure, manually removing the failing jobs is one option... but since we auto-generate the job configs, any time we add a new branch they'll come back.

i'd much rather do this the right way. :)

@vanzin
Copy link
Contributor

vanzin commented Oct 8, 2018

an unset var is just an empty string anyway.

The issue is that if you call that function multiple times the variables might be set by a previous call.

@shaneknapp
Copy link
Contributor

@vanzin
Copy link
Contributor

vanzin commented Oct 8, 2018

Hmmm... just want to raise a possible issue that maybe, just maybe, we should be hosting those jenkins configs in a repository that is owned by the ASF and writable by all Spark committers.

Or even as a directory under the Spark repo itself (and use them always from master).

Just a thought.

@shaneknapp
Copy link
Contributor

@vanzin i'm not opposed to hosting these configs somewhere else. @JoshRosen did this a few years back just to "get shit done"...

i'd be leery of putting this in to the main spark repo, however, as only a very, very, very small subset of people (consisting mostly of myself) should actually ever touch this stuff.

@srowen
Copy link
Member Author

srowen commented Oct 8, 2018

Yeah this does need to be in a public repo. apache/spark-jenkins-configurations or something. We can ask INFRA to create them. But, I'm not against just putting them in dev/ or something in the main repo. It's not much right? and we already host all the release scripts there which maybe 5 people are interested in right now.

@shaneknapp
Copy link
Contributor

@srowen fair 'nuf... i'll create a jira for this tomorrow and we can hash out final design shite there (rather than overloading this PR). :)

@HyukjinKwon
Copy link
Member

I want to see the configurations ..

@SparkQA
Copy link

SparkQA commented Oct 9, 2018

Test build #97137 has finished for PR 22615 at commit 7392cf0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I think we all know enough to not to make changes (merge changes) to these config, should be safe.

BUILD_PACKAGE=$3
SCALA_VERSION=$4

if [[ $BUILD_PACKAGE == *"withpip"* ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one caveat is I'm not sure we have tested building both python and R in "one build".

this could be a good thing but if I recall the R build changes some of the binary files under R that gets shipped in the "source release" (these are required R object file)

@@ -71,7 +71,7 @@ class HadoopTableReader(

// Hadoop honors "mapreduce.job.maps" as hint,
// but will ignore when mapreduce.jobtracker.address is "local".
// https://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
// https://hadoop.apache.org/docs/r2.7.6/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come this is 2.7.6 and not 2.7.3 like others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally want to update to the latest Hadoop 2.7.x, but had an issue with 2.7.7 and reverted. Here I wanted to go ahead and link to the latest anyway, but for some reason this doc was only in the 2.7.6 docs, not 2.7.7. I doubt the doc will actually vary non-trivially from version to version, but wanted to at least point at a 2.7.x version.

@@ -30,9 +30,6 @@ Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{s
uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version
({{site.SCALA_BINARY_VERSION}}.x).

Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0.
Support for Scala 2.10 was removed as of 2.3.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are not going to mention supported hadoop version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are onto 3.0, I figured we didn't need to keep documenting how version 2.2 and 2.3 worked. I also felt that the particular Hadoop version was only an issue in the distant past, when we were trying to support the odd world of mutually incompatible 2.x releases before 2.2. Now, it's no more of a high level issue than anything else. Indeed we might even just build vs Hadoop 3.x in the end and de-emphasize dependence on a particular version of Hadoop. But for now I just removed this note.

@srowen
Copy link
Member Author

srowen commented Oct 9, 2018

@felixcheung regarding building PIP and R in one release, yeah I was wondering that too. Ideally it would just be one. If the build changes only affect the source release, that's OK, as this is attached to a binary release, right? I suspected there wouldn't actually be any cross-over between the Python and R packaging in the binary release.

@shaneknapp
Copy link
Contributor

shaneknapp commented Oct 9, 2018 via email

@srowen
Copy link
Member Author

srowen commented Oct 10, 2018

I tried a release build that causes --pip and --r to be set, and the result looked OK. Both pyspark and R packages built and seemed normal. The source build worked too and comes before binary builds, so I don't think it can be affected. I will go ahead and merge this, I think.

@vanzin
Copy link
Contributor

vanzin commented Oct 10, 2018

Fine with me if the jenkins stuff is sorted out.

@shaneknapp
Copy link
Contributor

i haven't had a chance to do any of the jenkins stuff... after being sidetracked by the conversation to move the configs to the spark repo, plus planning for our big event that starts tomorrow, plus zomgmeetings all day today, work won't be able to start until early next week.

@srowen
Copy link
Member Author

srowen commented Oct 10, 2018

Merged to master. Note that the master hadoop 2.6 job will fail immediately now, so ignore it. On the upside ... this job already wont' take much of any time from the Jenkins cluster.

@asfgit asfgit closed this in 80813e1 Oct 10, 2018
@srowen srowen deleted the SPARK-25016 branch October 10, 2018 22:29
@shaneknapp
Copy link
Contributor

ok just to revisit this: i'm going to push out the new jenkins jobs configs now, and not gate on moving these to the spark repo.

@shaneknapp
Copy link
Contributor

shaneknapp commented Oct 18, 2018

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Remove Hadoop 2.6 references and make 2.7 the default.
Obviously, this is for master/3.0.0 only.
After this we can also get rid of the separate test jobs for Hadoop 2.6.

## How was this patch tested?

Existing tests

Closes apache#22615 from srowen/SPARK-25016.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
HyukjinKwon added a commit that referenced this pull request Nov 30, 2019
….6 in Jenkins's test script

### What changes were proposed in this pull request?

This PR proposes to remove the leftover. After #22615, we don't have Hadoop 2.6 profile anymore in master.

### Why are the changes needed?

Using "test-hadoop2.6" against master branch in a PR wouldn't work.

### Does this PR introduce any user-facing change?

No (dev only).

### How was this patch tested?

Manually tested at #26707 and Jenkins build will test.

Without this fix, and hadoop2.6 in the pr title, it shows as below:

```
========================================================================
Building Spark
========================================================================
[error] Could not find hadoop2.6 in the list. Valid options  are dict_keys(['hadoop2.7', 'hadoop3.2'])
Attempting to post to Github...
```

Closes #26708 from HyukjinKwon/SPARK-25016.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
….6 in Jenkins's test script

### What changes were proposed in this pull request?

This PR proposes to remove the leftover. After apache#22615, we don't have Hadoop 2.6 profile anymore in master.

### Why are the changes needed?

Using "test-hadoop2.6" against master branch in a PR wouldn't work.

### Does this PR introduce any user-facing change?

No (dev only).

### How was this patch tested?

Manually tested at apache#26707 and Jenkins build will test.

Without this fix, and hadoop2.6 in the pr title, it shows as below:

```
========================================================================
Building Spark
========================================================================
[error] Could not find hadoop2.6 in the list. Valid options  are dict_keys(['hadoop2.7', 'hadoop3.2'])
Attempting to post to Github...
```

Closes apache#26708 from HyukjinKwon/SPARK-25016.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants