SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs

srowen · pwendell · commit 514ee93daf20 · 2014-05-06T20:07:36.000-07:00
While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs (cherry picked from commit 25ad8f9) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
diff --git a/docs/README.md b/docs/README.md
@@ -14,9 +14,10 @@ The markdown code can be compiled to HTML using the
 [Jekyll tool](http://jekyllrb.com).
 To use the `jekyll` command, you will need to have Jekyll installed. 
 The easiest way to do this is via a Ruby Gem, see the 
-[jekyll installation instructions](http://jekyllrb.com/docs/installation). 
-Compiling the site with Jekyll will create a directory called 
-_site containing index.html as well as the rest of the compiled files.
+[jekyll installation instructions](http://jekyllrb.com/docs/installation).
+If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
+Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
+`_site` containing index.html as well as the rest of the compiled files.
 
 You can modify the default Jekyll build as follows:
 
@@ -44,6 +45,6 @@ You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PR
 
 Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.
 
-When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc.  The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
+When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc.  The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
 
 NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.
diff --git a/docs/_config.yml b/docs/_config.yml
@@ -8,5 +8,5 @@ SPARK_VERSION_SHORT: 1.0.0
 SCALA_BINARY_VERSION: "2.10"
 SCALA_VERSION: "2.10.4"
 MESOS_VERSION: 0.13.0
-SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
+SPARK_ISSUE_TRACKER_URL: https://issues.apache.org/jira/browse/SPARK
 SPARK_GITHUB_URL: https://github.com/apache/spark
diff --git a/docs/bagel-programming-guide.md b/docs/bagel-programming-guide.md
@@ -46,7 +46,7 @@ import org.apache.spark.bagel.Bagel._
 Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.
 
 {% highlight scala %}
-val input = sc.textFile("pagerank_data.txt")
+val input = sc.textFile("data/pagerank_data.txt")
 
 val numVerts = input.count()
 
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
@@ -181,7 +181,7 @@ The following table summarizes terms you'll see used to refer to cluster concept
       <td>Distinguishes where the driver process runs. In "cluster" mode, the framework launches
         the driver inside of the cluster. In "client" mode, the submitter launches the driver
         outside of the cluster.</td>
-    <tr>
+    </tr>
     <tr>
       <td>Worker node</td>
       <td>Any node that can run application code in the cluster</td>
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -26,10 +26,10 @@ application name), as well as arbitrary key-value pairs through the `set()` meth
 initialize an application as follows:
 
 {% highlight scala %}
-val conf = new SparkConf()
-             .setMaster("local")
-             .setAppName("My application")
-             .set("spark.executor.memory", "1g")
+val conf = new SparkConf().
+             setMaster("local").
+             setAppName("My application").
+             set("spark.executor.memory", "1g")
 val sc = new SparkContext(conf)
 {% endhighlight %}
 
@@ -318,7 +318,7 @@ Apart from these, the following properties are also available, and may be useful
     When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
     objects to prevent writing redundant data, however that stops garbage collection of those
     objects. By calling 'reset' you flush that info from the serializer, and allow old
-    objects to be collected. To turn off this periodic reset set it to a value of <= 0.
+    objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
     By default it will reset the serializer every 10,000 objects.
   </td>
 </tr>
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
@@ -55,7 +55,7 @@ classes.  RDD methods like `map` are overloaded by specialized `PairFunction`
 and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
 types.  Common methods like `filter` and `sample` are implemented by
 each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
-etc (this acheives the "same-result-type" principle used by the [Scala collections
+etc (this achieves the "same-result-type" principle used by the [Scala collections
 framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).
 
 ## Function Interfaces
@@ -102,7 +102,7 @@ the following changes:
   `Function` classes will need to use `implements` rather than `extends`.
 * Certain transformation functions now have multiple versions depending
   on the return type. In Spark core, the map functions (`map`, `flatMap`, and
-  `mapPartitons`) have type-specific versions, e.g. 
+  `mapPartitions`) have type-specific versions, e.g.
   [`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
   and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
   Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
@@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
 import org.apache.spark.api.java.*;
 import org.apache.spark.api.java.function.*;
 
-JavaSparkContext sc = new JavaSparkContext(...);
-JavaRDD<String> lines = ctx.textFile("hdfs://...");
+JavaSparkContext jsc = new JavaSparkContext(...);
+JavaRDD<String> lines = jsc.textFile("hdfs://...");
 JavaRDD<String> words = lines.flatMap(
   new FlatMapFunction<String, String>() {
-    public Iterable<String> call(String s) {
+    @Override public Iterable<String> call(String s) {
       return Arrays.asList(s.split(" "));
     }
   }
@@ -140,10 +140,10 @@ Here, the `FlatMapFunction` was created inline; another option is to subclass
 
 {% highlight java %}
 class Split extends FlatMapFunction<String, String> {
-  public Iterable<String> call(String s) {
+  @Override public Iterable<String> call(String s) {
     return Arrays.asList(s.split(" "));
   }
-);
+}
 JavaRDD<String> words = lines.flatMap(new Split());
 {% endhighlight %}
 
@@ -162,8 +162,8 @@ Continuing with the word count example, we map each word to a `(word, 1)` pair:
 import scala.Tuple2;
 JavaPairRDD<String, Integer> ones = words.mapToPair(
   new PairFunction<String, String, Integer>() {
-    public Tuple2<String, Integer> call(String s) {
-      return new Tuple2(s, 1);
+    @Override public Tuple2<String, Integer> call(String s) {
+      return new Tuple2<String, Integer>(s, 1);
     }
   }
 );
@@ -178,7 +178,7 @@ occurrences of each word:
 {% highlight java %}
 JavaPairRDD<String, Integer> counts = ones.reduceByKey(
   new Function2<Integer, Integer, Integer>() {
-    public Integer call(Integer i1, Integer i2) {
+    @Override public Integer call(Integer i1, Integer i2) {
       return i1 + i2;
     }
   }
diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md
@@ -9,7 +9,7 @@ title: <a href="mllib-guide.html">MLlib</a> - Basics
 MLlib supports local vectors and matrices stored on a single machine, 
 as well as distributed matrices backed by one or more RDDs.
 In the current implementation, local vectors and matrices are simple data models 
-to serve public interfaces. The underly linear algebra operations are provided by
+to serve public interfaces. The underlying linear algebra operations are provided by
 [Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
 A training example used in supervised learning is called "labeled point" in MLlib.
 
@@ -205,7 +205,7 @@ import org.apache.spark.mllib.regression.LabeledPoint;
 import org.apache.spark.mllib.util.MLUtils;
 import org.apache.spark.rdd.RDDimport;
 
-RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt")
+RDD<LabeledPoint> training = MLUtils.loadLibSVMData(jsc, "mllib/data/sample_libsvm_data.txt");
 {% endhighlight %}
 </div>
 </div>
@@ -307,6 +307,7 @@ A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.R
 created from a `JavaRDD<Vector>` instance.  Then we can compute its column summary statistics.
 
 {% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.mllib.linalg.distributed.RowMatrix;
 
@@ -348,10 +349,10 @@ val mat: RowMatrix = ... // a RowMatrix
 val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
 println(summary.mean) // a dense vector containing the mean value for each column
 println(summary.variance) // column-wise variance
-println(summary.numNonzers) // number of nonzeros in each column
+println(summary.numNonzeros) // number of nonzeros in each column
 
 // Compute the covariance matrix.
-val Cov: Matrix = mat.computeCovariance()
+val cov: Matrix = mat.computeCovariance()
 {% endhighlight %}
 </div>
 </div>
@@ -397,11 +398,12 @@ wrapper over `(long, Vector)`.  An `IndexedRowMatrix` can be converted to a `Row
 its row indices.
 
 {% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.distributed.IndexedRow;
 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
 import org.apache.spark.mllib.linalg.distributed.RowMatrix;
 
-JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows
+JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
 // Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
 IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
 
@@ -458,7 +460,9 @@ wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to a
 with sparse rows by calling `toIndexedRowMatrix`.
 
 {% highlight scala %}
+import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
+import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
 import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
 
 JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
@@ -18,7 +18,7 @@ models are trained for each cluster).
 MLlib supports
 [k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
 the most commonly used clustering algorithms that clusters the data points into
-predfined number of clusters. The MLlib implementation includes a parallelized
+predefined number of clusters. The MLlib implementation includes a parallelized
 variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
 called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
 The implementation in MLlib has the following parameters:  
@@ -30,7 +30,7 @@ initialization via k-means\|\|.
 * *runs* is the number of times to run the k-means algorithm (k-means is not
 guaranteed to find a globally optimal solution, and when run multiple times on
 a given dataset, the algorithm returns the best clustering result).
-* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
+* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
 * *epsilon* determines the distance threshold within which we consider k-means to have converged. 
 
 ## Examples
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
@@ -77,7 +77,7 @@ val ratesAndPreds = ratings.map{
 }.join(predictions)
 val MSE = ratesAndPreds.map{
     case ((user, product), (r1, r2)) =>  math.pow((r1- r2), 2)
-}.reduce(_ + _)/ratesAndPreds.count
+}.mean()
 println("Mean Squared Error = " + MSE)
 {% endhighlight %}
 
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
@@ -83,19 +83,19 @@ Section 9.2.4 in
 [Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
 details). For example, for a binary classification problem with one categorical feature with three
 categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
-features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
+features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
 and A , B \| C where \| denotes the split.
 
 ### Stopping rule
 
 The recursive tree construction is stopped at a node when one of the two conditions is met:
 
-1. The node depth is equal to the `maxDepth` training parammeter
+1. The node depth is equal to the `maxDepth` training parameter
 2. No split candidate leads to an information gain at the node.
 
 ### Practical limitations
 
-1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)*
+1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)*
    in memory for aggregating histograms over partitions. The current implementation might not scale
    to very deep trees since the memory requirement grows exponentially with tree depth.
 2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for
@@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
   val prediction = model.predict(point.features)
   (point.label, prediction)
 }
-val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
+val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.mean()
 println("training Mean Squared Error = " + MSE)
 {% endhighlight %}
 </div>
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
@@ -44,6 +44,10 @@ say, less than $1000$, but many rows, which we call *tall-and-skinny*.
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 {% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.SingularValueDecomposition
+
 val mat: RowMatrix = ...
 
 // Compute the top 20 singular values and corresponding singular vectors.
@@ -74,6 +78,9 @@ and use them to project the vectors into a low-dimensional space.
 The number of columns should be small, e.g, less than 1000.
 
 {% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+
 val mat: RowMatrix = ...
 
 // Compute the top 10 principal components.
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
@@ -94,7 +94,7 @@ import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.mllib.linalg.Vectors;
 
 double[] array = ... // a double array
-Vector vector = Vectors.dense(array) // a dense vector
+Vector vector = Vectors.dense(array); // a dense vector
 {% endhighlight %}
 
 [`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
@@ -63,7 +63,7 @@ methods MLlib supports:
   <tbody>
     <tr>
       <td>hinge loss</td><td>$\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$</td>
-      <td>$\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 &
+      <td>$\begin{cases}-y \cdot \x &amp; \text{if $y \wv^T \x &lt;1$}, \\ 0 &amp;
 \text{otherwise}.\end{cases}$</td>
     </tr>
     <tr>
@@ -225,10 +225,11 @@ algorithm for 200 iterations.
 import org.apache.spark.mllib.optimization.L1Updater
 
 val svmAlg = new SVMWithSGD()
-svmAlg.optimizer.setNumIterations(200)
-  .setRegParam(0.1)
-  .setUpdater(new L1Updater)
-val modelL1 = svmAlg.run(parsedData)
+svmAlg.optimizer.
+  setNumIterations(200).
+  setRegParam(0.1).
+  setUpdater(new L1Updater)
+val modelL1 = svmAlg.run(training)
 {% endhighlight %}
 
 Similarly, you can use replace `SVMWithSGD` by
@@ -322,7 +323,7 @@ val valuesAndPreds = parsedData.map { point =>
   val prediction = model.predict(point.features)
   (point.label, prediction)
 }
-val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count
+val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
 println("training Mean Squared Error = " + MSE)
 {% endhighlight %}
 
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
diff --git a/mllib/data/sample_naive_bayes_data.txt b/mllib/data/sample_naive_bayes_data.txt