Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1064 #102

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/building-with-maven.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,9 @@ Running only java 8 tests and nothing else.
Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spite of -DskipTests.
For these tests to run your system must have a JDK 8 installation.
If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.

## Packaging without Hadoop dependencies for deployment on YARN ##

The assembly jar produced by "mvn package" will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The "hadoop-provided" profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.


46 changes: 46 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -800,5 +800,51 @@
</modules>

</profile>

<!-- Build without Hadoop dependencies that are included in some runtime environments. -->
<profile>
<id>hadoop-provided</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-api</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-client</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-ipc</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<scope>provided</scope>
</dependency>
</dependencies>
</profile>

</profiles>
</project>
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,10 @@ import org.apache.hadoop.fs._
import org.apache.hadoop.fs.permission.FsPermission;
import org.apache.hadoop.io.DataOutputBuffer
import org.apache.hadoop.mapred.Master
import org.apache.hadoop.mapreduce.MRJobConfig
import org.apache.hadoop.net.NetUtils
import org.apache.hadoop.security.UserGroupInformation
import org.apache.hadoop.util.StringUtils
import org.apache.hadoop.yarn.api._
import org.apache.hadoop.yarn.api.ApplicationConstants.Environment
import org.apache.hadoop.yarn.api.protocolrecords._
Expand Down Expand Up @@ -379,9 +381,48 @@ object ClientBase {

// Based on code from org.apache.hadoop.mapreduce.v2.util.MRApps
def populateHadoopClasspath(conf: Configuration, env: HashMap[String, String]) {
for (c <- conf.getStrings(YarnConfiguration.YARN_APPLICATION_CLASSPATH)) {
val classpathEntries = Option(conf.getStrings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey just so I understand here. What is the way that YARN_APPLICATION_CLASSPATH and MAPREDUCE_APPLICATION_CCLASSPATH are used? Is this just designed to point to the locally installed Yarn/MR code? Or do users every go and include their own application code at these locations as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It points to the location on machines in the cluster of the HDFS/YARN/MR code. An admin might add a library like LZO to this, but users should instead be using the distributed cache if there's a jar specific to their application that they want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good. Just wanted to make sure this wasn't the main path for user application code.

YarnConfiguration.YARN_APPLICATION_CLASSPATH)).getOrElse(
getDefaultYarnApplicationClasspath())
for (c <- classpathEntries) {
Apps.addToEnvironment(env, Environment.CLASSPATH.name, c.trim)
}

val mrClasspathEntries = Option(conf.getStrings(
"mapreduce.application.classpath")).getOrElse(
getDefaultMRApplicationClasspath())
if (mrClasspathEntries != null) {
for (c <- mrClasspathEntries) {
Apps.addToEnvironment(env, Environment.CLASSPATH.name, c.trim)
}
}
}

def getDefaultYarnApplicationClasspath(): Array[String] = {
try {
val field = classOf[MRJobConfig].getField("DEFAULT_YARN_APPLICATION_CLASSPATH")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sryza rather than use reflection here why not just modify the Client.scala classes in common and stable? I think the main reason of separating those two was so that we don't have to use reflection everywhere. Is that hard to do in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the alpha/stable distinction doesn't fully capture the differences here because the APIs are different between the 0.23 Hadoop line and the 2.0 line, both of which fall under yarn-alpha. The comment above getMapReduceApplicationClasspath explains the differences between 0.23, 2.0, and 2.2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah noted, sorry I missed the comment.

field.get(null).asInstanceOf[Array[String]]
} catch {
case err: NoSuchFieldError => null
}
}

/**
* In Hadoop 0.23, the MR application classpath comes with the YARN application
* classpath. In Hadoop 2.0, it's an array of Strings, and in 2.2+ it's a String.
* So we need to use reflection to retrieve it.
*/
def getDefaultMRApplicationClasspath(): Array[String] = {
try {
val field = classOf[MRJobConfig].getField("DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH")
if (field.getType == classOf[String]) {
StringUtils.getStrings(field.get(null).asInstanceOf[String])
} else {
field.get(null).asInstanceOf[Array[String]]
}
} catch {
case err: NoSuchFieldError => null
}
}

def populateClasspath(conf: Configuration, sparkConf: SparkConf, addLog4j: Boolean, env: HashMap[String, String]) {
Expand Down