Skip to content

Commit

Permalink
[MINOR][DOC] Fix some typos and grammar issues
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes apache#20948

Author: Daniel Sakuma <dsakuma@gmail.com>

Closes apache#20928 from dsakuma/fix_typo_configuration_docs.
  • Loading branch information
dsakuma authored and HyukjinKwon committed Apr 6, 2018
1 parent 249007e commit 6ade5cb
Show file tree
Hide file tree
Showing 43 changed files with 107 additions and 107 deletions.
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ here with the Spark source code. You can also find documentation specific to rel
Spark at http://spark.apache.org/documentation.html.

Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the
documentation yourself. Why build it yourself? So that you have the docs that corresponds to
documentation yourself. Why build it yourself? So that you have the docs that correspond to
whichever version of Spark you currently have checked out of revision control.

## Prerequisites
Expand Down
2 changes: 1 addition & 1 deletion docs/_plugins/include_example.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def render(context)
begin
code = File.open(@file).read.encode("UTF-8")
rescue => e
# We need to explicitly exit on execptions here because Jekyll will silently swallow
# We need to explicitly exit on exceptions here because Jekyll will silently swallow
# them, leading to silent build failures (see https://github.com/jekyll/jekyll/issues/5104)
puts(e)
puts(e.backtrace)
Expand Down
2 changes: 1 addition & 1 deletion docs/building-spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ Note: Flume support is deprecated as of Spark 2.3.0.

## Building submodules individually

It's possible to build Spark sub-modules using the `mvn -pl` option.
It's possible to build Spark submodules using the `mvn -pl` option.

For instance, you can build the Spark Streaming module using:

Expand Down
4 changes: 2 additions & 2 deletions docs/cloud-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@ description: Introduction to cloud storage support in Apache Spark SPARK_VERSION
All major cloud providers offer persistent data storage in *object stores*.
These are not classic "POSIX" file systems.
In order to store hundreds of petabytes of data without any single points of failure,
object stores replace the classic filesystem directory tree
object stores replace the classic file system directory tree
with a simpler model of `object-name => data`. To enable remote access, operations
on objects are usually offered as (slow) HTTP REST operations.

Spark can read and write data in object stores through filesystem connectors implemented
in Hadoop or provided by the infrastructure suppliers themselves.
These connectors make the object stores look *almost* like filesystems, with directories and files
These connectors make the object stores look *almost* like file systems, with directories and files
and the classic operations on them such as list, delete and rename.


Expand Down
20 changes: 10 additions & 10 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,7 @@ Apart from these, the following properties are also available, and may be useful
<td>
This configuration limits the number of remote requests to fetch blocks at any given point.
When the number of hosts in the cluster increase, it might lead to very large number
of in-bound connections to one or more nodes, causing the workers to fail under load.
of inbound connections to one or more nodes, causing the workers to fail under load.
By allowing it to limit the number of fetch requests, this scenario can be mitigated.
</td>
</tr>
Expand Down Expand Up @@ -1288,7 +1288,7 @@ Apart from these, the following properties are also available, and may be useful
<td>4194304 (4 MB)</td>
<td>
The estimated cost to open a file, measured by the number of bytes could be scanned at the same
time. This is used when putting multiple files into a partition. It is better to over estimate,
time. This is used when putting multiple files into a partition. It is better to overestimate,
then the partitions with small files will be faster than partitions with bigger files.
</td>
</tr>
Expand Down Expand Up @@ -1513,7 +1513,7 @@ Apart from these, the following properties are also available, and may be useful
<td>0.8 for KUBERNETES mode; 0.8 for YARN mode; 0.0 for standalone mode and Mesos coarse-grained mode</td>
<td>
The minimum ratio of registered resources (registered resources / total expected resources)
(resources are executors in yarn mode and Kubernetes mode, CPU cores in standalone mode and Mesos coarsed-grained
(resources are executors in yarn mode and Kubernetes mode, CPU cores in standalone mode and Mesos coarse-grained
mode ['spark.cores.max' value is total expected resources for Mesos coarse-grained mode] )
to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
Regardless of whether the minimum ratio of resources has been reached,
Expand Down Expand Up @@ -1634,7 +1634,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
(Experimental) If set to "true", Spark will blacklist the executor immediately when a fetch
failure happenes. If external shuffle service is enabled, then the whole node will be
failure happens. If external shuffle service is enabled, then the whole node will be
blacklisted.
</td>
</tr>
Expand Down Expand Up @@ -1722,7 +1722,7 @@ Apart from these, the following properties are also available, and may be useful
When <code>spark.task.reaper.enabled = true</code>, this setting specifies a timeout after
which the executor JVM will kill itself if a killed task has not stopped running. The default
value, -1, disables this mechanism and prevents the executor from self-destructing. The purpose
of this setting is to act as a safety-net to prevent runaway uncancellable tasks from rendering
of this setting is to act as a safety-net to prevent runaway noncancellable tasks from rendering
an executor unusable.
</td>
</tr>
Expand Down Expand Up @@ -1915,8 +1915,8 @@ showDF(properties, numRows = 200, truncate = FALSE)
<td><code>spark.streaming.receiver.writeAheadLog.enable</code></td>
<td>false</td>
<td>
Enable write ahead logs for receivers. All the input data received through receivers
will be saved to write ahead logs that will allow it to be recovered after driver failures.
Enable write-ahead logs for receivers. All the input data received through receivers
will be saved to write-ahead logs that will allow it to be recovered after driver failures.
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
in the Spark Streaming programing guide for more details.
</td>
Expand Down Expand Up @@ -1971,7 +1971,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
<td><code>spark.streaming.driver.writeAheadLog.closeFileAfterWrite</code></td>
<td>false</td>
<td>
Whether to close the file after writing a write ahead log record on the driver. Set this to 'true'
Whether to close the file after writing a write-ahead log record on the driver. Set this to 'true'
when you want to use S3 (or any file system that does not support flushing) for the metadata WAL
on the driver.
</td>
Expand All @@ -1980,7 +1980,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
<td><code>spark.streaming.receiver.writeAheadLog.closeFileAfterWrite</code></td>
<td>false</td>
<td>
Whether to close the file after writing a write ahead log record on the receivers. Set this to 'true'
Whether to close the file after writing a write-ahead log record on the receivers. Set this to 'true'
when you want to use S3 (or any file system that does not support flushing) for the data WAL
on the receivers.
</td>
Expand Down Expand Up @@ -2178,7 +2178,7 @@ Spark's classpath for each application. In a Spark cluster running on YARN, thes
files are set cluster-wide, and cannot safely be changed by the application.

The better choice is to use spark hadoop properties in the form of `spark.hadoop.*`.
They can be considered as same as normal spark properties which can be set in `$SPARK_HOME/conf/spark-defalut.conf`
They can be considered as same as normal spark properties which can be set in `$SPARK_HOME/conf/spark-default.conf`

In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
instance, Spark allows you to simply create an empty conf and set spark/spark hadoop properties.
Expand Down
2 changes: 1 addition & 1 deletion docs/css/pygments-default.css
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ To generate this, I had to run
But first I had to install pygments via easy_install pygments
I had to override the conflicting bootstrap style rules by linking to
this stylesheet lower in the html than the bootstap css.
this stylesheet lower in the html than the bootstrap css.
Also, I was thrown off for a while at first when I was using markdown
code block inside my {% highlight scala %} ... {% endhighlight %} tags
Expand Down
4 changes: 2 additions & 2 deletions docs/graphx-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ val joinedGraph = graph.joinVertices(uniqueCosts)(
The more general [`outerJoinVertices`][Graph.outerJoinVertices] behaves similarly to `joinVertices`
except that the user defined `map` function is applied to all vertices and can change the vertex
property type. Because not all vertices may have a matching value in the input RDD the `map`
function takes an `Option` type. For example, we can setup a graph for PageRank by initializing
function takes an `Option` type. For example, we can set up a graph for PageRank by initializing
vertex properties with their `outDegree`.


Expand Down Expand Up @@ -969,7 +969,7 @@ A vertex is part of a triangle when it has two adjacent vertices with an edge be
# Examples

Suppose I want to build a graph from some text files, restrict the graph
to important relationships and users, run page-rank on the sub-graph, and
to important relationships and users, run page-rank on the subgraph, and
then finally return attributes associated with the top users. I can do
all of this in just a few lines with GraphX:

Expand Down
4 changes: 2 additions & 2 deletions docs/job-scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ run tasks and store data for that application. If multiple users need to share y
different options to manage allocation, depending on the cluster manager.

The simplest option, available on all cluster managers, is _static partitioning_ of resources. With
this approach, each application is given a maximum amount of resources it can use, and holds onto them
this approach, each application is given a maximum amount of resources it can use and holds onto them
for its whole duration. This is the approach used in Spark's [standalone](spark-standalone.html)
and [YARN](running-on-yarn.html) modes, as well as the
[coarse-grained Mesos mode](running-on-mesos.html#mesos-run-modes).
Expand Down Expand Up @@ -230,7 +230,7 @@ properties:
* `minShare`: Apart from an overall weight, each pool can be given a _minimum shares_ (as a number of
CPU cores) that the administrator would like it to have. The fair scheduler always attempts to meet
all active pools' minimum shares before redistributing extra resources according to the weights.
The `minShare` property can therefore be another way to ensure that a pool can always get up to a
The `minShare` property can, therefore, be another way to ensure that a pool can always get up to a
certain number of resources (e.g. 10 cores) quickly without giving it a high priority for the rest
of the cluster. By default, each pool's `minShare` is 0.

Expand Down
2 changes: 1 addition & 1 deletion docs/ml-advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Quasi-Newton methods in this case. This fallback is currently always enabled for
L1 regularization is applied (i.e. $\alpha = 0$), there exists an analytical solution and either Cholesky or Quasi-Newton solver may be used. When $\alpha > 0$ no analytical
solution exists and we instead use the Quasi-Newton solver to find the coefficients iteratively.

In order to make the normal equation approach efficient, `WeightedLeastSquares` requires that the number of features be no more than 4096. For larger problems, use L-BFGS instead.
In order to make the normal equation approach efficient, `WeightedLeastSquares` requires that the number of features is no more than 4096. For larger problems, use L-BFGS instead.

## Iteratively reweighted least squares (IRLS)

Expand Down
6 changes: 3 additions & 3 deletions docs/ml-classification-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ Refer to the [R API docs](api/R/spark.svmLinear.html) for more details.

[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently. It is also known as "One-vs-All."

`OneVsRest` is implemented as an `Estimator`. For the base classifier it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
`OneVsRest` is implemented as an `Estimator`. For the base classifier, it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.

Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.

Expand Down Expand Up @@ -908,7 +908,7 @@ Refer to the [R API docs](api/R/spark.survreg.html) for more details.
belongs to the family of regression algorithms. Formally isotonic regression is a problem where
given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` representing observed responses
and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
finding a function that minimises
finding a function that minimizes

`\begin{equation}
f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
Expand All @@ -927,7 +927,7 @@ We implement a
which uses an approach to
[parallelizing isotonic regression](http://doi.org/10.1007/978-3-642-99789-1_10).
The training input is a DataFrame which contains three columns
label, features and weight. Additionally IsotonicRegression algorithm has one
label, features and weight. Additionally, IsotonicRegression algorithm has one
optional parameter called $isotonic$ defaulting to true.
This argument specifies if the isotonic regression is
isotonic (monotonically increasing) or antitonic (monotonically decreasing).
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-collaborative-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ but the ids must be within the integer value range.

### Explicit vs. implicit feedback

The standard approach to matrix factorization based collaborative filtering treats
The standard approach to matrix factorization-based collaborative filtering treats
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
for example, users giving ratings to movies.

Expand Down
2 changes: 1 addition & 1 deletion docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1174,7 +1174,7 @@ for more details on the API.
## SQLTransformer

`SQLTransformer` implements the transformations which are defined by SQL statement.
Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
Currently, we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
where `"__THIS__"` represents the underlying table of the input dataset.
The select clause specifies the fields, constants, and expressions to display in
the output, and can be any select clause that Spark SQL supports. Users can also
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-migration-guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ rather than using the old parameter class `Strategy`. These new training method
separate classification and regression, and they replace specialized parameter types with
simple `String` types.

Examples of the new, recommended `trainClassifier` and `trainRegressor` are given in the
Examples of the new recommended `trainClassifier` and `trainRegressor` are given in the
[Decision Trees Guide](mllib-decision-tree.html#examples).

## From 0.9 to 1.0
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ Refer to the [`CrossValidator` Python docs](api/python/pyspark.ml.html#pyspark.m

In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
the case of `CrossValidator`. It is therefore less expensive,
the case of `CrossValidator`. It is, therefore, less expensive,
but will not produce as reliable results when the training dataset is not sufficiently large.

Unlike `CrossValidator`, `TrainValidationSplit` creates a single (training, test) dataset pair.
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ The following code snippets can be executed in `spark-shell`.
In the following example after loading and parsing data, we use the
[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact, the
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.

Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`KMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel) for details on the API.
Expand Down
4 changes: 2 additions & 2 deletions docs/mllib-collaborative-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ following parameters:

### Explicit vs. implicit feedback

The standard approach to matrix factorization based collaborative filtering treats
The standard approach to matrix factorization-based collaborative filtering treats
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
for example, users giving ratings to movies.

Expand Down Expand Up @@ -60,7 +60,7 @@ best parameter learned from a sampled subset to the full dataset and expect simi
<div class="codetabs">

<div data-lang="scala" markdown="1">
In the following example we load rating data. Each row consists of a user, a product and a rating.
In the following example, we load rating data. Each row consists of a user, a product and a rating.
We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
method which assumes ratings are explicit. We evaluate the
recommendation model by measuring the Mean Squared Error of rating prediction.
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ which is a tuple of `(Int, Int, Matrix)`.
***Note***

The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
In general the use of non-deterministic RDDs can lead to errors.
In general, the use of non-deterministic RDDs can lead to errors.

### RowMatrix

Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-dimensionality-reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ The same code applies to `IndexedRowMatrix` if `U` is defined as an

[Principal component analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) is a
statistical method to find a rotation such that the first coordinate has the largest variance
possible, and each succeeding coordinate in turn has the largest variance possible. The columns of
possible, and each succeeding coordinate, in turn, has the largest variance possible. The columns of
the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.

`spark.mllib` supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-evaluation-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ of the model on some criteria, which depends on the application and its requirem
suite of metrics for the purpose of evaluating the performance of machine learning models.

Specific machine learning algorithms fall under broader types of machine learning applications like classification,
regression, clustering, etc. Each of these types have well established metrics for performance evaluation and those
regression, clustering, etc. Each of these types have well-established metrics for performance evaluation and those
metrics that are currently available in `spark.mllib` are detailed in this section.

## Classification model evaluation
Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top
\]`
where $V$ is the vocabulary size.

The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$
is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec,
we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to
$O(\log(V))$
Expand Down
Loading

0 comments on commit 6ade5cb

Please sign in to comment.