Hadoop agnostic builds #838

jey · 2013-08-15T23:58:12Z

This PR allows one Spark binary to target multiple Hadoop versions. It also moves YARN support into a separate artifact. This is the follow-up to PR #803.

CC: @mateiz, @mridulm, @tgravescs

… => SparkHadoopMapReduceUtil

…ckage

AmplabJenkins · 2013-08-16T02:04:10Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/609/

AmplabJenkins · 2013-08-16T05:11:23Z

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/613/

jey · 2013-08-16T17:50:58Z

Jenkins, retest this please.

AmplabJenkins · 2013-08-16T18:14:33Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/623/

AmplabJenkins · 2013-08-16T20:30:51Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/627/

jey · 2013-08-17T01:36:18Z

(I meant: the Maven build is having problems with 0.23.x)

AmplabJenkins · 2013-08-19T01:56:24Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/643/

AmplabJenkins · 2013-08-19T02:18:54Z

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/645/

jey · 2013-08-19T04:48:37Z

Jenkins, retest this please.

AmplabJenkins · 2013-08-19T05:25:13Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/653/

mateiz · 2013-08-19T18:27:39Z

project/SparkBuild.scala

+        Seq(
+          "org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm),
+          "org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm),
+          "org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm)


Hey Jey, just to understand, this means that that users who link to us when running on hadoop 0.23.x have to also add these to their project in addition to hadoop-client version 0.23.x?

Nah, because the issue of explicitly linking against the hadoop libs only applies to non-YARN builds. That does bring up another issue though: right now the spark-core artifact will by defaualt be built with dependency on hadoop >= 1.2.1. I'll look into figuring out how to specify a more accurate set of constraints to the POM dependency mechanism

That dependency is fine for spark-core. The main thing is to document what else users should add to use a newer Hadoop. (E.g. They'd add a newer hadoop-client, but they may also have to add this yarn stuff).

Matei

On Aug 19, 2013, at 12:12 PM, Jey Kottalam notifications@github.com wrote:

In project/SparkBuild.scala:

"org.apache.hadoop" % "hadoop-core" % HADOOP_VERSION excludeAll(excludeJackson, excludeNetty, excludeAsm),

"org.apache.hadoop" % "hadoop-client" % HADOOP_VERSION excludeAll(excludeJackson, excludeNetty, excludeAsm)

)

}

} else {

Seq("org.apache.hadoop" % "hadoop-core" % HADOOP_VERSION excludeAll(excludeJackson, excludeNetty) )

}),

unmanagedSourceDirectories in Compile <+= baseDirectory{ _ /

( if (HADOOP_YARN && HADOOP_MAJOR_VERSION == "2") {

"src/hadoop2-yarn/scala"

if (isYarnMode) {

// This kludge is needed for 0.23.x

Seq(

"org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm),

"org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm),

"org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion excludeAll(excludeJackson, excludeNetty, excludeAsm)
Nah, because the issue of explicitly linking against the hadoop libs only applies to non-YARN builds. That does bring up another issue though: right now the spark-core artifact will by defaualt be built with dependency on hadoop >= 1.2.1. I'll look into figuring out how to specify a more accurate set of constraints to the POM dependency mechanism

—
Reply to this email directly or view it on GitHub.

mateiz · 2013-08-19T18:59:36Z

Hey Jey, I tested this and it looks good, though I had that question above.

AmplabJenkins · 2013-08-19T20:02:25Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/658/

mateiz · 2013-08-20T17:14:56Z

Thanks for putting this together, Jey. I've merged it manually due to a small conflict.

pwendell · 2013-08-21T00:47:11Z

@mateiz @jey - It's very unfortunate that this got merged without any documentation or notification to developers. This will affect many downstream things (tests, anyone running off of master, or building things on top of master, the ec2 scripts, etc). Also, some of the existing documentation, such as docs/building-with-maven are now invalid and tell users to do the wrong thing. Could one of you send an e-mail to the dev list explaining what this means for people that consume Spark master? Also, please fix the existing docs ASAP and ideally add new docs explaining how to use this.

rxin · 2013-08-21T01:13:42Z

Have you tried running mvn package?

I am getting the following error:

*** RUN ABORTED ***
java.lang.NoSuchMethodError: spark.scheduler.cluster.ClusterTaskSetManager.(Lspark/scheduler/cluster/ClusterScheduler;Lspark/scheduler/TaskSet;)V
at spark.scheduler.DummyTaskSetManager.(ClusterSchedulerSuite.scala:30)
at spark.scheduler.ClusterSchedulerSuite.createDummyTaskSetManager(ClusterSchedulerSuite.scala:111)
at spark.scheduler.ClusterSchedulerSuite$$anonfun$1.apply$mcV$sp(ClusterSchedulerSuite.scala:146)
at spark.scheduler.ClusterSchedulerSuite$$anonfun$1.apply(ClusterSchedulerSuite.scala:134)
at spark.scheduler.ClusterSchedulerSuite$$anonfun$1.apply(ClusterSchedulerSuite.scala:134)
at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
at spark.scheduler.ClusterSchedulerSuite.withFixture(ClusterSchedulerSuite.scala:108)
at org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
at org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .......................... SUCCESS [2.028s]
[INFO] Spark Project Core ................................ FAILURE [7:24.910s]
[INFO] Spark Project Bagel ............................... SKIPPED
[INFO] Spark Project Streaming ........................... SKIPPED
[INFO] Spark Project ML Library .......................... SKIPPED
[INFO] Spark Project Examples ............................ SKIPPED
[INFO] Spark Project Tools ............................... SKIPPED
[INFO] Spark Project REPL ................................ SKIPPED
[INFO] Spark Project REPL binary packaging ............... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7:27.541s

mateiz · 2013-08-21T01:22:59Z

Did you do mvn clean and sbt clean? Sounds like an old build issue.

mateiz · 2013-08-21T01:23:29Z

But yes I agree with Patrick on the docs -- I shouldn't have merged this without looking at that and trying Shark as well, so we know what will break there. Sorry about that.

jey · 2013-08-21T02:02:13Z

@pwendell: Agreed, I'll send am email to the list ASAP and submit a patch for the docs shortly.

@rxin: As Matei said, that sounds like your classpath is contaminated with old build artifacts.

rxin · 2013-08-21T02:04:32Z

I tried running mvn dependency:tree after I removed .m2 and .ivy2 and sbt clean and mvn clean. Got the following error

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .......................... SUCCESS [0.706s]
[INFO] Spark Project Core ................................ SUCCESS [1.026s]
[INFO] Spark Project Bagel ............................... FAILURE [0.038s]
[INFO] Spark Project Streaming ........................... SKIPPED
[INFO] Spark Project ML Library .......................... SKIPPED
[INFO] Spark Project Examples ............................ SKIPPED
[INFO] Spark Project Tools ............................... SKIPPED
[INFO] Spark Project REPL ................................ SKIPPED
[INFO] Spark Project REPL binary packaging ............... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.574s
[INFO] Finished at: Tue Aug 20 19:03:59 PDT 2013
[INFO] Final Memory: 10M/81M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project spark-bagel: Could not resolve dependencies for project org.spark-project:spark-bagel:jar:0.8.0-SNAPSHOT: Could not find artifact org.spark-project:spark-core:jar:0.8.0-SNAPSHOT -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]

jey · 2013-08-21T02:08:31Z

@rxin, it's my understanding that this is "normal" for running dependency:tree on the Maven build before packaging. I think after performing mvn -DskipTests package you'll be able to run mvn dependency:tree, mvn package, mvn test, etc.

jey · 2013-08-21T02:49:25Z

@rxin: actually, apparently Maven in its infinite wisdom requires mvn install before mvn dependency:tree will work: http://stackoverflow.com/a/1905927

rxin · 2013-08-21T03:09:51Z

alright thanks @jey. that worked (although a little bit convoluted...)

… have been merged into Spark master. See mesos/spark#838

Fix build problems due to mesos/spark#838

`lateral_view_outer` query sometimes returns a different set of 10 rows. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes mesos#838 from tdas/hive-test-fix2 and squashes the following commits: 9128a0d [Tathagata Das] Blacklisted flaky HiveCompatibility test.

jey added 16 commits August 15, 2013 16:50

move yarn to its own directory

b877e20

remove core/src/hadoop{1,2} dirs

f67b94a

dynamically detect hadoop version

69c3bbf

yarn sbt

273b499

remove hadoop-yarn's org/apache/...

5d0785b

add comment

8b1c152

yarn support

cb4ef19

rename HadoopMapRedUtil => SparkHadoopMapRedUtil, HadoopMapReduceUtil…

43ebcb8

… => SparkHadoopMapReduceUtil

make SparkHadoopUtil a member of SparkEnv

4f43fd7

SparkEnv isn't available this early, and not needed anyway

bd0bab4

re-enable YARN support

e2d7656

YARN ApplicationMaster shouldn't wait forever

8bb0bd1

update YARN docs

14b6bcd

Fix newTaskAttemptID to work under YARN

8f979ed

Rename HadoopWriter to SparkHadoopWriter since it's outside of our pa…

a06a9d5

…ckage

Update default version of Hadoop to 1.2.1

a0f0848

jey added 9 commits August 16, 2013 13:50

Allow make-distribution.sh to specify Hadoop version used

3f98eff

Fix repl/assembly when YARN enabled

8add2d7

Initial changes to make Maven build agnostic of hadoop version

353fab2

Maven build now works with CDH hadoop-2.0.0-mr1

11b42a8

Don't mark hadoop-client as 'provided'

9dd15fe

Forgot to remove a few references to ${classifier}

741ecd5

Maven build now also works with YARN

ad580b9

Updates to repl and example POMs to match SBT build

c1e547b

Fix SBT build under Hadoop 0.23.x

b1d9974

jey added 4 commits August 18, 2013 16:23

Make YARN POM file valid

44000b1

Don't assume spark-examples JAR always exists

47a7c43

Fix Maven build with Hadoop 0.23.9

bdd861c

Remove redundant dependencies from POMs

23f4622

mateiz reviewed Aug 19, 2013
View reviewed changes

Update SBT build to use simpler fix for Hadoop 0.23.9

6f6944c

mateiz merged commit 6f6944c into mesos:master Aug 20, 2013

jey mentioned this pull request Aug 20, 2013

Add Spark multi-user support for standalone mode #750

Closed

andyk mentioned this pull request Aug 20, 2013

Update Shark build file now that hadoop agnostic build update has been merged into Spark master amplab/shark#133

Merged

zhuguangbin pushed a commit to zhuguangbin/shark that referenced this pull request Oct 31, 2013

Fix build problems due to mesos/spark#838

d1d2e28

zhuguangbin pushed a commit to zhuguangbin/shark that referenced this pull request Oct 31, 2013

Update bin/dev/run-tests-from-scratch now that Hadoop agnostic builds…

777a425

… have been merged into Spark master. See mesos/spark#838

zhuguangbin pushed a commit to zhuguangbin/shark that referenced this pull request Oct 31, 2013

Merge pull request amplab#132 from mateiz/build-fix

8b3a7bb

Fix build problems due to mesos/spark#838

Hadoop agnostic builds #838

Hadoop agnostic builds #838

Uh oh!

Conversation

jey commented Aug 15, 2013

Uh oh!

AmplabJenkins commented Aug 16, 2013

Uh oh!

AmplabJenkins commented Aug 16, 2013

Uh oh!

jey commented Aug 16, 2013

Uh oh!

AmplabJenkins commented Aug 16, 2013

Uh oh!

AmplabJenkins commented Aug 16, 2013

Uh oh!

jey commented Aug 17, 2013

Uh oh!

AmplabJenkins commented Aug 19, 2013

Uh oh!

AmplabJenkins commented Aug 19, 2013

Uh oh!

jey commented Aug 19, 2013

Uh oh!

AmplabJenkins commented Aug 19, 2013

Uh oh!

mateiz Aug 19, 2013

Choose a reason for hiding this comment

Uh oh!

jey Aug 19, 2013

Choose a reason for hiding this comment

Uh oh!

mateiz Aug 19, 2013

Choose a reason for hiding this comment

Uh oh!

mateiz commented Aug 19, 2013

Uh oh!

AmplabJenkins commented Aug 19, 2013

Uh oh!

mateiz commented Aug 20, 2013

Uh oh!

pwendell commented Aug 21, 2013

Uh oh!

rxin commented Aug 21, 2013

Uh oh!

mateiz commented Aug 21, 2013

Uh oh!

mateiz commented Aug 21, 2013

Uh oh!

jey commented Aug 21, 2013

Uh oh!

rxin commented Aug 21, 2013

Uh oh!

jey commented Aug 21, 2013

Uh oh!

jey commented Aug 21, 2013

Uh oh!

rxin commented Aug 21, 2013

Uh oh!

Uh oh!