-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6 #22615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #96867 has finished for PR 22615 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sean, for a reminder, https://github.com/apache/spark/blob/master/dev/appveyor-install-dependencies.ps1#L98 should be matched to https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1
Yep, already updated in the first commit actually. I'm not sure if that's why appveyor failed here. Let's see on another test. |
Test build #96894 has finished for PR 22615 at commit
|
Test build #96930 has finished for PR 22615 at commit
|
Test build #96942 has finished for PR 22615 at commit
|
Test build #96961 has finished for PR 22615 at commit
|
dev/create-release/release-build.sh
Outdated
@@ -288,7 +290,9 @@ if [[ "$1" == "package" ]]; then | |||
declare -A BINARY_PKGS_EXTRA | |||
BINARY_PKGS_EXTRA["hadoop2.7"]="withpip" | |||
if ! is_dry_run; then | |||
BINARY_PKGS_EXTRA["hadoop2.6"]="withr" | |||
if [[ $SPARK_VERSION < "3.0." ]]; then | |||
BINARY_PKGS_EXTRA["hadoop2.6"]="withr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, now you're not building the R packaging on 3.0.
The old build was admittedly a bit odd. It only supported one "extra" arg so it build 2.6+R and 2.7+PIP. It seems we need to change that now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let me try to rewrite the script to support two args and attach these both to 2.7
What is the change in |
That app-... file had classpath-like references to hadoop-...-2.6.5 jars. It may not matter but I updated them. That should be the only difference |
It also has references to a bunch of other old stuff; I don't think there's a need to change it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the spark-events test file too.
BUILD_PACKAGE=$3 | ||
SCALA_VERSION=$4 | ||
|
||
if [[ $BUILD_PACKAGE == *"withpip"* ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin what do you think of this approach? It simplifies the logic below too, avoiding repeating the main build step 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. Using wildcards is a little weird but I guess that's the cleanest way in bash.
But shouldn't you initialize PIP_FLAG and R_FLAG to empty before these checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one caveat is I'm not sure we have tested building both python and R in "one build".
this could be a good thing but if I recall the R build changes some of the binary files under R that gets shipped in the "source release" (these are required R object file)
Test build #97054 has finished for PR 22615 at commit
|
Test build #97105 has finished for PR 22615 at commit
|
Test build #97116 has finished for PR 22615 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you pinged @shaneknapp about removing the 2.6 jobs for the master branch before pushing this?
BUILD_PACKAGE=$3 | ||
SCALA_VERSION=$4 | ||
|
||
if [[ $BUILD_PACKAGE == *"withpip"* ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. Using wildcards is a little weird but I guess that's the cleanest way in bash.
But shouldn't you initialize PIP_FLAG and R_FLAG to empty before these checks?
I guess we've just pinged @shaneknapp ! But I figured the jobs would simply fail and could be removed at leisure. Yes, this mechanism is a little weird but may be the simplest thing here. I can add explicit init of those two flags to an empty string but an unset var is just an empty string anyway. |
consider me pinged. ;) i will need to do some refactoring in the jenkins job builder configs for this, so we'll definitely need to coordinate before this is merged. most likely i won't have much time until next week (risecamp will be taking all of my time wed-fri), but i'll see if i can't at least get an initial PR on this stuff by EOD tomorrow (oct 9th). @JoshRosen for a heads up on the forthcoming job config builder changes. |
@srowen sure, manually removing the failing jobs is one option... but since we auto-generate the job configs, any time we add a new branch they'll come back. i'd much rather do this the right way. :) |
The issue is that if you call that function multiple times the variables might be set by a previous call. |
Hmmm... just want to raise a possible issue that maybe, just maybe, we should be hosting those jenkins configs in a repository that is owned by the ASF and writable by all Spark committers. Or even as a directory under the Spark repo itself (and use them always from master). Just a thought. |
@vanzin i'm not opposed to hosting these configs somewhere else. @JoshRosen did this a few years back just to "get shit done"... i'd be leery of putting this in to the main spark repo, however, as only a very, very, very small subset of people (consisting mostly of myself) should actually ever touch this stuff. |
Yeah this does need to be in a public repo. apache/spark-jenkins-configurations or something. We can ask INFRA to create them. But, I'm not against just putting them in dev/ or something in the main repo. It's not much right? and we already host all the release scripts there which maybe 5 people are interested in right now. |
@srowen fair 'nuf... i'll create a jira for this tomorrow and we can hash out final design shite there (rather than overloading this PR). :) |
I want to see the configurations .. |
Test build #97137 has finished for PR 22615 at commit
|
I think we all know enough to not to make changes (merge changes) to these config, should be safe. |
BUILD_PACKAGE=$3 | ||
SCALA_VERSION=$4 | ||
|
||
if [[ $BUILD_PACKAGE == *"withpip"* ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one caveat is I'm not sure we have tested building both python and R in "one build".
this could be a good thing but if I recall the R build changes some of the binary files under R that gets shipped in the "source release" (these are required R object file)
@@ -71,7 +71,7 @@ class HadoopTableReader( | |||
|
|||
// Hadoop honors "mapreduce.job.maps" as hint, | |||
// but will ignore when mapreduce.jobtracker.address is "local". | |||
// https://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ | |||
// https://hadoop.apache.org/docs/r2.7.6/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come this is 2.7.6 and not 2.7.3 like others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally want to update to the latest Hadoop 2.7.x, but had an issue with 2.7.7 and reverted. Here I wanted to go ahead and link to the latest anyway, but for some reason this doc was only in the 2.7.6 docs, not 2.7.7. I doubt the doc will actually vary non-trivially from version to version, but wanted to at least point at a 2.7.x version.
@@ -30,9 +30,6 @@ Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{s | |||
uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version | |||
({{site.SCALA_BINARY_VERSION}}.x). | |||
|
|||
Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0. | |||
Support for Scala 2.10 was removed as of 2.3.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we are not going to mention supported hadoop version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are onto 3.0, I figured we didn't need to keep documenting how version 2.2 and 2.3 worked. I also felt that the particular Hadoop version was only an issue in the distant past, when we were trying to support the odd world of mutually incompatible 2.x releases before 2.2. Now, it's no more of a high level issue than anything else. Indeed we might even just build vs Hadoop 3.x in the end and de-emphasize dependence on a particular version of Hadoop. But for now I just removed this note.
@felixcheung regarding building PIP and R in one release, yeah I was wondering that too. Ideally it would just be one. If the build changes only affect the source release, that's OK, as this is attached to a binary release, right? I suspected there wouldn't actually be any cross-over between the Python and R packaging in the binary release. |
I want to see the configurations ..
they're just some absolutely breath-taking bits of yaml to define all of
the spark jenkins jobs.
aka: really nothing that exciting. ;)
|
I tried a release build that causes |
Fine with me if the jenkins stuff is sorted out. |
i haven't had a chance to do any of the jenkins stuff... after being sidetracked by the conversation to move the configs to the spark repo, plus planning for our big event that starts tomorrow, plus zomgmeetings all day today, work won't be able to start until early next week. |
Merged to master. Note that the master hadoop 2.6 job will fail immediately now, so ignore it. On the upside ... this job already wont' take much of any time from the Jenkins cluster. |
ok just to revisit this: i'm going to push out the new jenkins jobs configs now, and not gate on moving these to the spark repo. |
## What changes were proposed in this pull request? Remove Hadoop 2.6 references and make 2.7 the default. Obviously, this is for master/3.0.0 only. After this we can also get rid of the separate test jobs for Hadoop 2.6. ## How was this patch tested? Existing tests Closes apache#22615 from srowen/SPARK-25016. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
….6 in Jenkins's test script ### What changes were proposed in this pull request? This PR proposes to remove the leftover. After #22615, we don't have Hadoop 2.6 profile anymore in master. ### Why are the changes needed? Using "test-hadoop2.6" against master branch in a PR wouldn't work. ### Does this PR introduce any user-facing change? No (dev only). ### How was this patch tested? Manually tested at #26707 and Jenkins build will test. Without this fix, and hadoop2.6 in the pr title, it shows as below: ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.6 in the list. Valid options are dict_keys(['hadoop2.7', 'hadoop3.2']) Attempting to post to Github... ``` Closes #26708 from HyukjinKwon/SPARK-25016. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
….6 in Jenkins's test script ### What changes were proposed in this pull request? This PR proposes to remove the leftover. After apache#22615, we don't have Hadoop 2.6 profile anymore in master. ### Why are the changes needed? Using "test-hadoop2.6" against master branch in a PR wouldn't work. ### Does this PR introduce any user-facing change? No (dev only). ### How was this patch tested? Manually tested at apache#26707 and Jenkins build will test. Without this fix, and hadoop2.6 in the pr title, it shows as below: ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.6 in the list. Valid options are dict_keys(['hadoop2.7', 'hadoop3.2']) Attempting to post to Github... ``` Closes apache#26708 from HyukjinKwon/SPARK-25016. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Remove Hadoop 2.6 references and make 2.7 the default.
Obviously, this is for master/3.0.0 only.
After this we can also get rid of the separate test jobs for Hadoop 2.6.
How was this patch tested?
Existing tests