Skip to content

Commit 0f39ae3

Browse files
committed
Merge in master
2 parents 4879c75 + 0ea0b1a commit 0f39ae3

File tree

12,237 files changed

+286258
-9536
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

12,237 files changed

+286258
-9536
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
sbt/*.jar
88
.settings
99
.cache
10+
.mima-excludes
1011
/build/
1112
work/
1213
out/
@@ -45,3 +46,5 @@ dist/
4546
spark-*-bin.tar.gz
4647
unit-tests.log
4748
/lib/
49+
rat-results.txt
50+
scalastyle.txt

.rat-excludes

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
target
2+
.gitignore
3+
.project
4+
.classpath
5+
.mima-excludes
6+
.rat-excludes
7+
.*md
8+
derby.log
9+
TAGS
10+
RELEASE
11+
control
12+
docs
13+
fairscheduler.xml.template
14+
spark-defaults.conf.template
15+
log4j.properties
16+
log4j.properties.template
17+
metrics.properties.template
18+
slaves
19+
spark-env.sh
20+
spark-env.sh.template
21+
log4j-defaults.properties
22+
sorttable.js
23+
.*txt
24+
.*data
25+
.*log
26+
cloudpickle.py
27+
join.py
28+
SparkExprTyper.scala
29+
SparkILoop.scala
30+
SparkILoopInit.scala
31+
SparkIMain.scala
32+
SparkImports.scala
33+
SparkJLineCompletion.scala
34+
SparkJLineReader.scala
35+
SparkMemberHandlers.scala
36+
sbt
37+
sbt-launch-lib.bash
38+
plugins.sbt
39+
work
40+
.*\.q
41+
golden
42+
test.out/*
43+
.*iml
44+
service.properties
45+
db.lck

.travis.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one or more
2+
# contributor license agreements. See the NOTICE file distributed with
3+
# this work for additional information regarding copyright ownership.
4+
# The ASF licenses this file to You under the Apache License, Version 2.0
5+
# (the "License"); you may not use this file except in compliance with
6+
# the License. You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
language: scala
17+
scala:
18+
- "2.10.3"
19+
jdk:
20+
- oraclejdk7
21+
env:
22+
matrix:
23+
- TEST="scalastyle assembly/assembly"
24+
- TEST="catalyst/test sql/test streaming/test mllib/test graphx/test bagel/test"
25+
- TEST=hive/test
26+
cache:
27+
directories:
28+
- $HOME/.m2
29+
- $HOME/.ivy2
30+
- $HOME/.sbt
31+
script:
32+
- "sbt ++$TRAVIS_SCALA_VERSION $TEST"

NOTICE

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
Apache Spark
2-
Copyright 2013 The Apache Software Foundation.
2+
Copyright 2014 The Apache Software Foundation.
33

44
This product includes software developed at
55
The Apache Software Foundation (http://www.apache.org/).
6+
7+
In addition, this product includes:
8+
9+
- JUnit (http://www.junit.org) is a testing framework for Java. We included it
10+
under the terms of the Eclipse Public License v1.0.
11+
12+
- JTransforms (https://sites.google.com/site/piotrwendykier/software/jtransforms)
13+
provides fast transforms in Java. It is tri-licensed, and we included it under
14+
the terms of the Mozilla Public License v1.1.

README.md

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,33 @@ guide, on the project webpage at <http://spark.apache.org/documentation.html>.
1010
This README file only contains basic setup instructions.
1111

1212

13-
## Building
13+
## Building Spark
1414

15-
Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT),
16-
which can be obtained [here](http://www.scala-sbt.org). If SBT is installed we
17-
will use the system version of sbt otherwise we will attempt to download it
18-
automatically. To build Spark and its example programs, run:
15+
Spark is built on Scala 2.10. To build Spark and its example programs, run:
1916

2017
./sbt/sbt assembly
2118

22-
Once you've built Spark, the easiest way to start using it is the shell:
19+
## Interactive Scala Shell
20+
21+
The easiest way to start using Spark is through the Scala shell:
2322

2423
./bin/spark-shell
2524

26-
Or, for the Python API, the Python shell (`./bin/pyspark`).
25+
Try the following command, which should return 1000:
26+
27+
scala> sc.parallelize(1 to 1000).count()
28+
29+
## Interactive Python Shell
30+
31+
Alternatively, if you prefer Python, you can use the Python shell:
32+
33+
./bin/pyspark
34+
35+
And run the following command, which should also return 1000:
36+
37+
>>> sc.parallelize(range(1000)).count()
38+
39+
## Example Programs
2740

2841
Spark also comes with several sample programs in the `examples` directory.
2942
To run one of them, use `./bin/run-example <class> <params>`. For example:
@@ -38,13 +51,13 @@ All of the Spark samples take a `<master>` parameter that is the cluster URL
3851
to connect to. This can be a mesos:// or spark:// URL, or "local" to run
3952
locally with one thread, or "local[N]" to run locally with N threads.
4053

41-
## Running tests
54+
## Running Tests
4255

43-
Testing first requires [Building](#building) Spark. Once Spark is built, tests
56+
Testing first requires [building Spark](#building-spark). Once Spark is built, tests
4457
can be run using:
4558

46-
`./sbt/sbt test`
47-
59+
./sbt/sbt test
60+
4861
## A Note About Hadoop Versions
4962

5063
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported

assembly/pom.xml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,11 @@
7979
<artifactId>spark-graphx_${scala.binary.version}</artifactId>
8080
<version>${project.version}</version>
8181
</dependency>
82+
<dependency>
83+
<groupId>org.apache.spark</groupId>
84+
<artifactId>spark-sql_${scala.binary.version}</artifactId>
85+
<version>${project.version}</version>
86+
</dependency>
8287
<dependency>
8388
<groupId>net.sf.py4j</groupId>
8489
<artifactId>py4j</artifactId>
@@ -158,6 +163,16 @@
158163
</dependency>
159164
</dependencies>
160165
</profile>
166+
<profile>
167+
<id>hive</id>
168+
<dependencies>
169+
<dependency>
170+
<groupId>org.apache.spark</groupId>
171+
<artifactId>spark-hive_${scala.binary.version}</artifactId>
172+
<version>${project.version}</version>
173+
</dependency>
174+
</dependencies>
175+
</profile>
161176
<profile>
162177
<id>spark-ganglia-lgpl</id>
163178
<dependencies>
@@ -203,7 +218,7 @@
203218
<plugin>
204219
<groupId>org.codehaus.mojo</groupId>
205220
<artifactId>buildnumber-maven-plugin</artifactId>
206-
<version>1.1</version>
221+
<version>1.2</version>
207222
<executions>
208223
<execution>
209224
<phase>validate</phase>

bagel/src/main/scala/org/apache/spark/bagel/Bagel.scala

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -220,27 +220,31 @@ object Bagel extends Logging {
220220
*/
221221
private def comp[K: Manifest, V <: Vertex, M <: Message[K], C](
222222
sc: SparkContext,
223-
grouped: RDD[(K, (Seq[C], Seq[V]))],
223+
grouped: RDD[(K, (Iterable[C], Iterable[V]))],
224224
compute: (V, Option[C]) => (V, Array[M]),
225225
storageLevel: StorageLevel
226226
): (RDD[(K, (V, Array[M]))], Int, Int) = {
227227
var numMsgs = sc.accumulator(0)
228228
var numActiveVerts = sc.accumulator(0)
229-
val processed = grouped.flatMapValues {
230-
case (_, vs) if vs.size == 0 => None
231-
case (c, vs) =>
229+
val processed = grouped.mapValues(x => (x._1.iterator, x._2.iterator))
230+
.flatMapValues {
231+
case (_, vs) if !vs.hasNext => None
232+
case (c, vs) => {
232233
val (newVert, newMsgs) =
233-
compute(vs(0), c match {
234-
case Seq(comb) => Some(comb)
235-
case Seq() => None
236-
})
234+
compute(vs.next,
235+
c.hasNext match {
236+
case true => Some(c.next)
237+
case false => None
238+
}
239+
)
237240

238241
numMsgs += newMsgs.size
239242
if (newVert.active) {
240243
numActiveVerts += 1
241244
}
242245

243246
Some((newVert, newMsgs))
247+
}
244248
}.persist(storageLevel)
245249

246250
// Force evaluation of processed RDD for accurate performance measurements

bagel/src/test/scala/org/apache/spark/bagel/BagelSuite.scala

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,15 @@ import org.scalatest.time.SpanSugar._
2424
import org.apache.spark._
2525
import org.apache.spark.storage.StorageLevel
2626

27+
import scala.language.postfixOps
28+
2729
class TestVertex(val active: Boolean, val age: Int) extends Vertex with Serializable
2830
class TestMessage(val targetId: String) extends Message[String] with Serializable
2931

3032
class BagelSuite extends FunSuite with Assertions with BeforeAndAfter with Timeouts {
31-
33+
3234
var sc: SparkContext = _
33-
35+
3436
after {
3537
if (sc != null) {
3638
sc.stop()

bin/compute-classpath.sh

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,35 +25,55 @@ SCALA_VERSION=2.10
2525
# Figure out where Spark is installed
2626
FWDIR="$(cd `dirname $0`/..; pwd)"
2727

28-
# Load environment variables from conf/spark-env.sh, if it exists
29-
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
30-
. $FWDIR/conf/spark-env.sh
31-
fi
28+
. $FWDIR/bin/load-spark-env.sh
3229

3330
# Build up classpath
3431
CLASSPATH="$SPARK_CLASSPATH:$FWDIR/conf"
3532

33+
ASSEMBLY_DIR="$FWDIR/assembly/target/scala-$SCALA_VERSION"
34+
3635
# First check if we have a dependencies jar. If so, include binary classes with the deps jar
37-
if [ -f "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar ]; then
36+
if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
3837
CLASSPATH="$CLASSPATH:$FWDIR/core/target/scala-$SCALA_VERSION/classes"
3938
CLASSPATH="$CLASSPATH:$FWDIR/repl/target/scala-$SCALA_VERSION/classes"
4039
CLASSPATH="$CLASSPATH:$FWDIR/mllib/target/scala-$SCALA_VERSION/classes"
4140
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/classes"
4241
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/classes"
4342
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/classes"
43+
CLASSPATH="$CLASSPATH:$FWDIR/tools/target/scala-$SCALA_VERSION/classes"
44+
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/classes"
45+
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
46+
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"
4447

45-
DEPS_ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar`
48+
DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`
4649
CLASSPATH="$CLASSPATH:$DEPS_ASSEMBLY_JAR"
4750
else
4851
# Else use spark-assembly jar from either RELEASE or assembly directory
4952
if [ -f "$FWDIR/RELEASE" ]; then
50-
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark-assembly*.jar`
53+
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark*-assembly*.jar`
5154
else
52-
ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*.jar`
55+
ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*.jar`
5356
fi
5457
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR"
5558
fi
5659

60+
# When Hive support is needed, Datanucleus jars must be included on the classpath.
61+
# Datanucleus jars do not work if only included in the uber jar as plugin.xml metadata is lost.
62+
# Both sbt and maven will populate "lib_managed/jars/" with the datanucleus jars when Spark is
63+
# built with Hive, so first check if the datanucleus jars exist, and then ensure the current Spark
64+
# assembly is built for Hive, before actually populating the CLASSPATH with the jars.
65+
# Note that this check order is faster (by up to half a second) in the case where Hive is not used.
66+
num_datanucleus_jars=$(ls "$FWDIR"/lib_managed/jars/ 2>/dev/null | grep "datanucleus-.*\\.jar" | wc -l)
67+
if [ $num_datanucleus_jars -gt 0 ]; then
68+
AN_ASSEMBLY_JAR=${ASSEMBLY_JAR:-$DEPS_ASSEMBLY_JAR}
69+
num_hive_files=$(jar tvf "$AN_ASSEMBLY_JAR" org/apache/hadoop/hive/ql/exec 2>/dev/null | wc -l)
70+
if [ $num_hive_files -gt 0 ]; then
71+
echo "Spark assembly has been built with Hive, including Datanucleus jars on classpath" 1>&2
72+
DATANUCLEUSJARS=$(echo "$FWDIR/lib_managed/jars"/datanucleus-*.jar | tr " " :)
73+
CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS
74+
fi
75+
fi
76+
5777
# Add test classes if we're running from SBT or Maven with SPARK_TESTING set to 1
5878
if [[ $SPARK_TESTING == 1 ]]; then
5979
CLASSPATH="$CLASSPATH:$FWDIR/core/target/scala-$SCALA_VERSION/test-classes"
@@ -62,6 +82,9 @@ if [[ $SPARK_TESTING == 1 ]]; then
6282
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/test-classes"
6383
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/test-classes"
6484
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/test-classes"
85+
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/test-classes"
86+
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/test-classes"
87+
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/test-classes"
6588
fi
6689

6790
# Add hadoop conf dir if given -- otherwise FileSystem.*, etc fail !

bin/load-spark-env.sh

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/usr/bin/env bash
2+
3+
#
4+
# Licensed to the Apache Software Foundation (ASF) under one or more
5+
# contributor license agreements. See the NOTICE file distributed with
6+
# this work for additional information regarding copyright ownership.
7+
# The ASF licenses this file to You under the Apache License, Version 2.0
8+
# (the "License"); you may not use this file except in compliance with
9+
# the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
#
19+
20+
# This script loads spark-env.sh if it exists, and ensures it is only loaded once.
21+
# spark-env.sh is loaded from SPARK_CONF_DIR if set, or within the current directory's
22+
# conf/ subdirectory.
23+
24+
if [ -z "$SPARK_ENV_LOADED" ]; then
25+
export SPARK_ENV_LOADED=1
26+
27+
# Returns the parent of the directory this script lives in.
28+
parent_dir="$(cd `dirname $0`/..; pwd)"
29+
30+
use_conf_dir=${SPARK_CONF_DIR:-"$parent_dir/conf"}
31+
32+
if [ -f "${use_conf_dir}/spark-env.sh" ]; then
33+
# Promote all variable declarations to environment (exported) variables
34+
set -a
35+
. "${use_conf_dir}/spark-env.sh"
36+
set +a
37+
fi
38+
fi

0 commit comments

Comments
 (0)