Skip to content

Commit 514ee93

Browse files
srowenpwendell
authored andcommitted
SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs
While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs (cherry picked from commit 25ad8f9) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
1 parent 8cfebf5 commit 514ee93

17 files changed

+97
-68
lines changed

docs/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ The markdown code can be compiled to HTML using the
1414
[Jekyll tool](http://jekyllrb.com).
1515
To use the `jekyll` command, you will need to have Jekyll installed.
1616
The easiest way to do this is via a Ruby Gem, see the
17-
[jekyll installation instructions](http://jekyllrb.com/docs/installation).
18-
Compiling the site with Jekyll will create a directory called
19-
_site containing index.html as well as the rest of the compiled files.
17+
[jekyll installation instructions](http://jekyllrb.com/docs/installation).
18+
If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
19+
Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
20+
`_site` containing index.html as well as the rest of the compiled files.
2021

2122
You can modify the default Jekyll build as follows:
2223

@@ -44,6 +45,6 @@ You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PR
4445

4546
Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.
4647

47-
When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
48+
When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
4849

4950
NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.

docs/_config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,5 @@ SPARK_VERSION_SHORT: 1.0.0
88
SCALA_BINARY_VERSION: "2.10"
99
SCALA_VERSION: "2.10.4"
1010
MESOS_VERSION: 0.13.0
11-
SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
11+
SPARK_ISSUE_TRACKER_URL: https://issues.apache.org/jira/browse/SPARK
1212
SPARK_GITHUB_URL: https://github.com/apache/spark

docs/bagel-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ import org.apache.spark.bagel.Bagel._
4646
Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.
4747

4848
{% highlight scala %}
49-
val input = sc.textFile("pagerank_data.txt")
49+
val input = sc.textFile("data/pagerank_data.txt")
5050

5151
val numVerts = input.count()
5252

docs/cluster-overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ The following table summarizes terms you'll see used to refer to cluster concept
181181
<td>Distinguishes where the driver process runs. In "cluster" mode, the framework launches
182182
the driver inside of the cluster. In "client" mode, the submitter launches the driver
183183
outside of the cluster.</td>
184-
<tr>
184+
</tr>
185185
<tr>
186186
<td>Worker node</td>
187187
<td>Any node that can run application code in the cluster</td>

docs/configuration.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,10 @@ application name), as well as arbitrary key-value pairs through the `set()` meth
2626
initialize an application as follows:
2727

2828
{% highlight scala %}
29-
val conf = new SparkConf()
30-
.setMaster("local")
31-
.setAppName("My application")
32-
.set("spark.executor.memory", "1g")
29+
val conf = new SparkConf().
30+
setMaster("local").
31+
setAppName("My application").
32+
set("spark.executor.memory", "1g")
3333
val sc = new SparkContext(conf)
3434
{% endhighlight %}
3535

@@ -318,7 +318,7 @@ Apart from these, the following properties are also available, and may be useful
318318
When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
319319
objects to prevent writing redundant data, however that stops garbage collection of those
320320
objects. By calling 'reset' you flush that info from the serializer, and allow old
321-
objects to be collected. To turn off this periodic reset set it to a value of <= 0.
321+
objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
322322
By default it will reset the serializer every 10,000 objects.
323323
</td>
324324
</tr>

docs/java-programming-guide.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ classes. RDD methods like `map` are overloaded by specialized `PairFunction`
5555
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
5656
types. Common methods like `filter` and `sample` are implemented by
5757
each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
58-
etc (this acheives the "same-result-type" principle used by the [Scala collections
58+
etc (this achieves the "same-result-type" principle used by the [Scala collections
5959
framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).
6060

6161
## Function Interfaces
@@ -102,7 +102,7 @@ the following changes:
102102
`Function` classes will need to use `implements` rather than `extends`.
103103
* Certain transformation functions now have multiple versions depending
104104
on the return type. In Spark core, the map functions (`map`, `flatMap`, and
105-
`mapPartitons`) have type-specific versions, e.g.
105+
`mapPartitions`) have type-specific versions, e.g.
106106
[`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
107107
and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
108108
Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
@@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
115115
import org.apache.spark.api.java.*;
116116
import org.apache.spark.api.java.function.*;
117117

118-
JavaSparkContext sc = new JavaSparkContext(...);
119-
JavaRDD<String> lines = ctx.textFile("hdfs://...");
118+
JavaSparkContext jsc = new JavaSparkContext(...);
119+
JavaRDD<String> lines = jsc.textFile("hdfs://...");
120120
JavaRDD<String> words = lines.flatMap(
121121
new FlatMapFunction<String, String>() {
122-
public Iterable<String> call(String s) {
122+
@Override public Iterable<String> call(String s) {
123123
return Arrays.asList(s.split(" "));
124124
}
125125
}
@@ -140,10 +140,10 @@ Here, the `FlatMapFunction` was created inline; another option is to subclass
140140

141141
{% highlight java %}
142142
class Split extends FlatMapFunction<String, String> {
143-
public Iterable<String> call(String s) {
143+
@Override public Iterable<String> call(String s) {
144144
return Arrays.asList(s.split(" "));
145145
}
146-
);
146+
}
147147
JavaRDD<String> words = lines.flatMap(new Split());
148148
{% endhighlight %}
149149

@@ -162,8 +162,8 @@ Continuing with the word count example, we map each word to a `(word, 1)` pair:
162162
import scala.Tuple2;
163163
JavaPairRDD<String, Integer> ones = words.mapToPair(
164164
new PairFunction<String, String, Integer>() {
165-
public Tuple2<String, Integer> call(String s) {
166-
return new Tuple2(s, 1);
165+
@Override public Tuple2<String, Integer> call(String s) {
166+
return new Tuple2<String, Integer>(s, 1);
167167
}
168168
}
169169
);
@@ -178,7 +178,7 @@ occurrences of each word:
178178
{% highlight java %}
179179
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
180180
new Function2<Integer, Integer, Integer>() {
181-
public Integer call(Integer i1, Integer i2) {
181+
@Override public Integer call(Integer i1, Integer i2) {
182182
return i1 + i2;
183183
}
184184
}

docs/mllib-basics.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ title: <a href="mllib-guide.html">MLlib</a> - Basics
99
MLlib supports local vectors and matrices stored on a single machine,
1010
as well as distributed matrices backed by one or more RDDs.
1111
In the current implementation, local vectors and matrices are simple data models
12-
to serve public interfaces. The underly linear algebra operations are provided by
12+
to serve public interfaces. The underlying linear algebra operations are provided by
1313
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
1414
A training example used in supervised learning is called "labeled point" in MLlib.
1515

@@ -205,7 +205,7 @@ import org.apache.spark.mllib.regression.LabeledPoint;
205205
import org.apache.spark.mllib.util.MLUtils;
206206
import org.apache.spark.rdd.RDDimport;
207207

208-
RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt")
208+
RDD<LabeledPoint> training = MLUtils.loadLibSVMData(jsc, "mllib/data/sample_libsvm_data.txt");
209209
{% endhighlight %}
210210
</div>
211211
</div>
@@ -307,6 +307,7 @@ A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.R
307307
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.
308308

309309
{% highlight java %}
310+
import org.apache.spark.api.java.JavaRDD;
310311
import org.apache.spark.mllib.linalg.Vector;
311312
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
312313

@@ -348,10 +349,10 @@ val mat: RowMatrix = ... // a RowMatrix
348349
val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
349350
println(summary.mean) // a dense vector containing the mean value for each column
350351
println(summary.variance) // column-wise variance
351-
println(summary.numNonzers) // number of nonzeros in each column
352+
println(summary.numNonzeros) // number of nonzeros in each column
352353

353354
// Compute the covariance matrix.
354-
val Cov: Matrix = mat.computeCovariance()
355+
val cov: Matrix = mat.computeCovariance()
355356
{% endhighlight %}
356357
</div>
357358
</div>
@@ -397,11 +398,12 @@ wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `Row
397398
its row indices.
398399

399400
{% highlight java %}
401+
import org.apache.spark.api.java.JavaRDD;
400402
import org.apache.spark.mllib.linalg.distributed.IndexedRow;
401403
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
402404
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
403405

404-
JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows
406+
JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
405407
// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
406408
IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
407409

@@ -458,7 +460,9 @@ wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a
458460
with sparse rows by calling `toIndexedRowMatrix`.
459461

460462
{% highlight scala %}
463+
import org.apache.spark.api.java.JavaRDD;
461464
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
465+
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
462466
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
463467

464468
JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries

docs/mllib-clustering.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ models are trained for each cluster).
1818
MLlib supports
1919
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
2020
the most commonly used clustering algorithms that clusters the data points into
21-
predfined number of clusters. The MLlib implementation includes a parallelized
21+
predefined number of clusters. The MLlib implementation includes a parallelized
2222
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
2323
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
2424
The implementation in MLlib has the following parameters:
@@ -30,7 +30,7 @@ initialization via k-means\|\|.
3030
* *runs* is the number of times to run the k-means algorithm (k-means is not
3131
guaranteed to find a globally optimal solution, and when run multiple times on
3232
a given dataset, the algorithm returns the best clustering result).
33-
* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
33+
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
3434
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
3535

3636
## Examples

docs/mllib-collaborative-filtering.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ val ratesAndPreds = ratings.map{
7777
}.join(predictions)
7878
val MSE = ratesAndPreds.map{
7979
case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
80-
}.reduce(_ + _)/ratesAndPreds.count
80+
}.mean()
8181
println("Mean Squared Error = " + MSE)
8282
{% endhighlight %}
8383

docs/mllib-decision-tree.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -83,19 +83,19 @@ Section 9.2.4 in
8383
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
8484
details). For example, for a binary classification problem with one categorical feature with three
8585
categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
86-
features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
86+
features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
8787
and A , B \| C where \| denotes the split.
8888

8989
### Stopping rule
9090

9191
The recursive tree construction is stopped at a node when one of the two conditions is met:
9292

93-
1. The node depth is equal to the `maxDepth` training parammeter
93+
1. The node depth is equal to the `maxDepth` training parameter
9494
2. No split candidate leads to an information gain at the node.
9595

9696
### Practical limitations
9797

98-
1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)*
98+
1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)*
9999
in memory for aggregating histograms over partitions. The current implementation might not scale
100100
to very deep trees since the memory requirement grows exponentially with tree depth.
101101
2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for
@@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
178178
val prediction = model.predict(point.features)
179179
(point.label, prediction)
180180
}
181-
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
181+
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.mean()
182182
println("training Mean Squared Error = " + MSE)
183183
{% endhighlight %}
184184
</div>

0 commit comments

Comments
 (0)