Skip to content

[SPARK-1133] add small files input in MLlib #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 61 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
fd93e59
add small text files input API
yinxusen Mar 18, 2014
9bf87d4
Merge branch 'master' into small-files-input
yinxusen Mar 18, 2014
e3681f2
Spark 1246 add min max to stat counter
dwmclary Mar 18, 2014
e7423d4
Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
pwendell Mar 18, 2014
c27a7ab
fix errors and refine code
yinxusen Mar 18, 2014
2fa26ec
SPARK-1102: Create a saveAsNewAPIHadoopDataset method
CodingCat Mar 18, 2014
79e547f
Update copyright year in NOTICE to 2014
mateiz Mar 18, 2014
e108b9a
[SPARK-1260]: faster construction of features with intercept
mengxr Mar 18, 2014
f9d8a83
[SPARK-1266] persist factors in implicit ALS
mengxr Mar 19, 2014
cc2655a
Fix SPARK-1256: Master web UI and Worker web UI returns a 404 error
witgo Mar 19, 2014
a18ea00
Bundle tachyon: SPARK-1269
nicklan Mar 19, 2014
d55ec86
bugfix: Wrong "Duration" in "Active Stages" in stages page
BlackNiuza Mar 19, 2014
6112270
SPARK-1203 fix saving to hdfs from yarn
tgravescs Mar 19, 2014
ab747d3
Bugfixes/improvements to scheduler
mridulm Mar 19, 2014
79d07d6
[SPARK-1132] Persisting Web UI through refactoring the SparkListener …
andrewor14 Mar 19, 2014
67fa71c
Added doctest for map function in rdd.py
jyotiska Mar 19, 2014
1678931
SPARK-1099:Spark's local mode should probably respect spark.cores.max…
Mar 19, 2014
ffe272d
Revert "SPARK-1099:Spark's local mode should probably respect spark.c…
aarondav Mar 20, 2014
7d22941
remove merge process from smallTextFiles interface
yinxusen Mar 20, 2014
66a03e5
Principal Component Analysis
rezazadeh Mar 20, 2014
ca76423
[Hot Fix #42] Do not stop SparkUI if bind() is not called
andrewor14 Mar 20, 2014
9aadcff
SPARK-1251 Support for optimizing and executing structured queries
marmbrus Mar 21, 2014
e09139d
Fix maven jenkins: Add explicit init for required tables in SQLQueryS…
marmbrus Mar 21, 2014
78c0f25
remove useless code and consideration, neaten the code style
yinxusen Mar 21, 2014
7e17fe6
Add hive test files to repository. Remove download script.
marmbrus Mar 21, 2014
2c0aa22
SPARK-1279: Fix improper use of SimpleDateFormat
zsxwing Mar 21, 2014
dab5439
Make SQL keywords case-insensitive
mateiz Mar 21, 2014
d780983
Add asCode function for dumping raw tree representations.
marmbrus Mar 21, 2014
646e554
Fix to Stage UI to display numbers on progress bar
emtiazahmed Mar 22, 2014
d348362
fix code style problem and rewrite the testsuite for simplicity.
yinxusen Mar 22, 2014
abf6714
SPARK-1254. Supplemental fix for HTTPS on Maven Central
srowen Mar 23, 2014
57a4379
[SPARK-1292] In-memory columnar representation for Spark SQL
liancheng Mar 23, 2014
8265dc7
Fixed coding style issues in Spark SQL
liancheng Mar 23, 2014
80c2968
[SPARK-1212] Adding sparse data support and update KMeans
mengxr Mar 24, 2014
eae90e4
refine code documents
yinxusen Mar 24, 2014
839bd3f
remove the use of commons-io
yinxusen Mar 24, 2014
21109fb
SPARK-1144 Added license and RAT to check licenses.
ScrapCodes Mar 24, 2014
56db8a2
HOT FIX: Exclude test files from RAT
pwendell Mar 24, 2014
8043b7b
SPARK-1294 Fix resolution of uppercase field names using a HiveContext.
marmbrus Mar 25, 2014
dc126f2
SPARK-1094 Support MiMa for reporting binary compatibility accross ve…
pwendell Mar 25, 2014
5140598
SPARK-1128: set hadoop task properties when constructing HadoopRDD
CodingCat Mar 25, 2014
b637f2d
Unify the logic for column pruning, projection, and filtering of tabl…
marmbrus Mar 25, 2014
007a733
SPARK-1286: Make usage of spark-env.sh idempotent
aarondav Mar 25, 2014
05ed628
move wholefile interface from MLUtils to MLContext
yinxusen Mar 25, 2014
134ace7
Add more hive compatability tests to whitelist
marmbrus Mar 25, 2014
71d4ed2
SPARK-1316. Remove use of Commons IO
srowen Mar 25, 2014
f8111ea
SPARK-1319: Fix scheduler to account for tasks using > 1 CPUs.
shivaram Mar 25, 2014
8237df8
Avoid Option while generating call site
witgo Mar 25, 2014
f87dab8
fix logic error
yinxusen Mar 26, 2014
4f7d547
Initial experimentation with Travis CI configuration
marmbrus Mar 26, 2014
b859853
SPARK-1321 Use Guava's top k implementation rather than our BoundedPr…
rxin Mar 26, 2014
3b69987
change from Java code to Scala
yinxusen Mar 26, 2014
a0853a3
SPARK-1322, top in pyspark should sort result in descending order.
ScrapCodes Mar 26, 2014
345825d
Unified package definition format in Spark SQL
liancheng Mar 26, 2014
b0ea02a
modify scala doc, and add space after 'if'
yinxusen Mar 26, 2014
32cbdfd
[SQL] Un-ignore a test that is now passing.
marmbrus Mar 27, 2014
e15e574
[SQL] Add a custom serializer for maps since they do not have a no-ar…
marmbrus Mar 27, 2014
be6d96c
SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS
pwendell Mar 27, 2014
3e63d98
Spark 1095 : Adding explicit return types to all public methods
NirmalReddy Mar 27, 2014
1fa48d9
SPARK-1325. The maven build error for Spark Tools
srowen Mar 27, 2014
4ed60d1
rebase to the latest trunk to merge
yinxusen Mar 27, 2014
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
sbt/*.jar
.settings
.cache
.mima-excludes
/build/
work/
out/
Expand Down Expand Up @@ -45,3 +46,5 @@ dist/
spark-*-bin.tar.gz
unit-tests.log
/lib/
rat-results.txt
mllib/build/
41 changes: 41 additions & 0 deletions .rat-excludes
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
target
.gitignore
.project
.classpath
.mima-excludes
.rat-excludes
.*md
derby.log
TAGS
RELEASE
control
docs
fairscheduler.xml.template
log4j.properties
log4j.properties.template
metrics.properties.template
slaves
spark-env.sh
spark-env.sh.template
log4j-defaults.properties
sorttable.js
.*txt
.*data
.*log
cloudpickle.py
join.py
SparkExprTyper.scala
SparkILoop.scala
SparkILoopInit.scala
SparkIMain.scala
SparkImports.scala
SparkJLineCompletion.scala
SparkJLineReader.scala
SparkMemberHandlers.scala
sbt
sbt-launch-lib.bash
plugins.sbt
work
.*\.q
golden
test.out/*
37 changes: 37 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

language: scala
scala:
- "2.10.3"
jdk:
- oraclejdk7
env:
matrix:
- TEST=sql/test
- TEST=hive/test
- TEST=catalyst/test
- TEST=streaming/test
- TEST=graphx/test
- TEST=mllib/test
- TEST=graphx/test
- TEST=bagel/test
cache:
directories:
- $HOME/.m2
- $HOME/.ivy2
- $HOME/.sbt
script:
- "sbt ++$TRAVIS_SCALA_VERSION scalastyle $TEST"
11 changes: 10 additions & 1 deletion NOTICE
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
Apache Spark
Copyright 2013 The Apache Software Foundation.
Copyright 2014 The Apache Software Foundation.

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

In addition, this product includes:

- JUnit (http://www.junit.org) is a testing framework for Java. We included it
under the terms of the Eclipse Public License v1.0.

- JTransforms (https://sites.google.com/site/piotrwendykier/software/jtransforms)
provides fast transforms in Java. It is tri-licensed, and we included it under
the terms of the Mozilla Public License v1.1.
5 changes: 5 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@
<artifactId>spark-graphx_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>net.sf.py4j</groupId>
<artifactId>py4j</artifactId>
Expand Down
37 changes: 29 additions & 8 deletions bin/compute-classpath.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,31 +25,49 @@ SCALA_VERSION=2.10
# Figure out where Spark is installed
FWDIR="$(cd `dirname $0`/..; pwd)"

# Load environment variables from conf/spark-env.sh, if it exists
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
. $FWDIR/conf/spark-env.sh
fi
. $FWDIR/bin/load-spark-env.sh

# Build up classpath
CLASSPATH="$SPARK_CLASSPATH:$FWDIR/conf"

# Support for interacting with Hive. Since hive pulls in a lot of dependencies that might break
# existing Spark applications, it is not included in the standard spark assembly. Instead, we only
# include it in the classpath if the user has explicitly requested it by running "sbt hive/assembly"
# Hopefully we will find a way to avoid uber-jars entirely and deploy only the needed packages in
# the future.
if [ -f "$FWDIR"/sql/hive/target/scala-$SCALA_VERSION/spark-hive-assembly-*.jar ]; then
echo "Hive assembly found, including hive support. If this isn't desired run sbt hive/clean."

# Datanucleus jars do not work if only included in the uberjar as plugin.xml metadata is lost.
DATANUCLEUSJARS=$(JARS=("$FWDIR/lib_managed/jars"/datanucleus-*.jar); IFS=:; echo "${JARS[*]}")
CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS

ASSEMBLY_DIR="$FWDIR/sql/hive/target/scala-$SCALA_VERSION/"
else
ASSEMBLY_DIR="$FWDIR/assembly/target/scala-$SCALA_VERSION/"
fi

# First check if we have a dependencies jar. If so, include binary classes with the deps jar
if [ -f "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar ]; then
if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/repl/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/mllib/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/tools/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"

DEPS_ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar`
DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
CLASSPATH="$CLASSPATH:$DEPS_ASSEMBLY_JAR"
else
# Else use spark-assembly jar from either RELEASE or assembly directory
if [ -f "$FWDIR/RELEASE" ]; then
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark-assembly*.jar`
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark*-assembly*.jar`
else
ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*.jar`
ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*.jar`
fi
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR"
fi
Expand All @@ -62,6 +80,9 @@ if [[ $SPARK_TESTING == 1 ]]; then
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/test-classes"
fi

# Add hadoop conf dir if given -- otherwise FileSystem.*, etc fail !
Expand Down
35 changes: 35 additions & 0 deletions bin/load-spark-env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This script loads spark-env.sh if it exists, and ensures it is only loaded once.
# spark-env.sh is loaded from SPARK_CONF_DIR if set, or within the current directory's
# conf/ subdirectory.

if [ -z "$SPARK_ENV_LOADED" ]; then
export SPARK_ENV_LOADED=1

# Returns the parent of the directory this script lives in.
parent_dir="$(cd `dirname $0`/..; pwd)"

use_conf_dir=${SPARK_CONF_DIR:-"$parent_dir/conf"}

if [ -f "${use_conf_dir}/spark-env.sh" ]; then
. "${use_conf_dir}/spark-env.sh"
fi
fi
5 changes: 1 addition & 4 deletions bin/pyspark
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,7 @@ if [ ! -f "$FWDIR/RELEASE" ]; then
fi
fi

# Load environment variables from conf/spark-env.sh, if it exists
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
. $FWDIR/conf/spark-env.sh
fi
. $FWDIR/bin/load-spark-env.sh

# Figure out which Python executable to use
if [ -z "$PYSPARK_PYTHON" ] ; then
Expand Down
5 changes: 1 addition & 4 deletions bin/run-example
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,7 @@ FWDIR="$(cd `dirname $0`/..; pwd)"
# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"

# Load environment variables from conf/spark-env.sh, if it exists
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
. $FWDIR/conf/spark-env.sh
fi
. $FWDIR/bin/load-spark-env.sh

if [ -z "$1" ]; then
echo "Usage: run-example <example-class> [<args>]" >&2
Expand Down
8 changes: 2 additions & 6 deletions bin/spark-class
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,7 @@ FWDIR="$(cd `dirname $0`/..; pwd)"
# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"

# Load environment variables from conf/spark-env.sh, if it exists
if [ -e "$FWDIR/conf/spark-env.sh" ] ; then
. $FWDIR/conf/spark-env.sh
fi
. $FWDIR/bin/load-spark-env.sh

if [ -z "$1" ]; then
echo "Usage: spark-class <class> [<args>]" >&2
Expand Down Expand Up @@ -137,8 +134,7 @@ fi

# Compute classpath using external script
CLASSPATH=`$FWDIR/bin/compute-classpath.sh`

if [ "$1" == "org.apache.spark.tools.JavaAPICompletenessChecker" ]; then
if [[ "$1" =~ org.apache.spark.tools.* ]]; then
CLASSPATH="$CLASSPATH:$SPARK_TOOLS_JAR"
fi

Expand Down
4 changes: 1 addition & 3 deletions bin/spark-shell
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,7 @@ done
# Set MASTER from spark-env if possible
DEFAULT_SPARK_MASTER_PORT=7077
if [ -z "$MASTER" ]; then
if [ -e "$FWDIR/conf/spark-env.sh" ]; then
. "$FWDIR/conf/spark-env.sh"
fi
. $FWDIR/bin/load-spark-env.sh
if [ "x" != "x$SPARK_MASTER_IP" ]; then
if [ "y" != "y$SPARK_MASTER_PORT" ]; then
SPARK_MASTER_PORT="${SPARK_MASTER_PORT}"
Expand Down
5 changes: 0 additions & 5 deletions core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -200,11 +200,6 @@
<artifactId>derby</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.binary.version}</artifactId>
Expand Down
2 changes: 0 additions & 2 deletions core/src/main/scala/org/apache/spark/Aggregator.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@

package org.apache.spark

import scala.{Option, deprecated}

import org.apache.spark.util.collection.{AppendOnlyMap, ExternalAppendOnlyMap}

/**
Expand Down
Loading