Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -498,7 +498,7 @@ Apart from these, the following properties are also available, and may be useful
<td>
Reuse Python worker or not. If yes, it will use a fixed number of Python workers,
does not need to fork() a Python process for every task. It will be very useful
if there is large broadcast, then the broadcast will not be needed to transferred
if there is a large broadcast, then the broadcast will not need to be transferred
from JVM to Python worker for every task.
</td>
</tr>
Expand Down
4 changes: 2 additions & 2 deletions docs/graphx-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ val joinedGraph = graph.joinVertices(uniqueCosts,

A key step in many graph analytics tasks is aggregating information about the neighborhood of each
vertex.
For example, we might want to know the number of followers each user has or the average age of the
For example, we might want to know the number of followers each user has or the average age of
the followers of each user. Many iterative graph algorithms (e.g., PageRank, Shortest Path, and
connected components) repeatedly aggregate properties of neighboring vertices (e.g., current
PageRank Value, shortest path to the source, and smallest reachable vertex id).
Expand Down Expand Up @@ -700,7 +700,7 @@ a new value for the vertex property, and then send messages to neighboring verti
super step. Unlike Pregel, messages are computed in parallel as a
function of the edge triplet and the message computation has access to both the source and
destination vertex attributes. Vertices that do not receive a message are skipped within a super
step. The Pregel operators terminates iteration and returns the final graph when there are no
step. The Pregel operator terminates iteration and returns the final graph when there are no
messages remaining.

> Note, unlike more standard Pregel implementations, vertices in GraphX can only send messages to
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ Example applications are also provided in Python. For example,

./bin/spark-submit examples/src/main/python/pi.py 10

Spark also provides an experimental [R API](sparkr.html) since 1.4 (only DataFrames APIs included).
To run Spark interactively in a R interpreter, use `bin/sparkR`:
Spark also provides an [R API](sparkr.html) since 1.4 (only DataFrames APIs included).
To run Spark interactively in an R interpreter, use `bin/sparkR`:

./bin/sparkR --master local[2]

Expand Down
2 changes: 1 addition & 1 deletion docs/ml-datasource.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ displayTitle: Data sources
---

In this section, we introduce how to use data source in ML to load data.
Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
Besides some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.

**Table of Contents**

Expand Down
8 changes: 4 additions & 4 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -359,7 +359,7 @@ Assume that we have the following DataFrame with columns `id` and `raw`:
~~~~
id | raw
----|----------
0 | [I, saw, the, red, baloon]
0 | [I, saw, the, red, balloon]
1 | [Mary, had, a, little, lamb]
~~~~

Expand All @@ -369,7 +369,7 @@ column, we should get the following:
~~~~
id | raw | filtered
----|-----------------------------|--------------------
0 | [I, saw, the, red, baloon] | [saw, red, baloon]
0 | [I, saw, the, red, balloon] | [saw, red, balloon]
1 | [Mary, had, a, little, lamb]|[Mary, little, lamb]
~~~~

Expand Down Expand Up @@ -1308,15 +1308,15 @@ need to know vector size, can use that column as an input.
To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this
transformer to a dataframe produces a new dataframe with updated metadata for `inputCol` specifying
the vector size. Downstream operations on the resulting dataframe can get this size using the
meatadata.
metadata.

`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
behaviour when the vector column contains nulls or vectors of the wrong size. By default
`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
also be set to "skip", indicating that rows containing invalid values should be filtered out from
the resulting dataframe, or "optimistic", indicating that the column should not be checked for
invalid values and all rows should be kept. Note that the use of "optimistic" can cause the
resulting dataframe to be in an inconsistent state, me:aning the metadata for the column
resulting dataframe to be in an inconsistent state, meaning the metadata for the column
`VectorSizeHint` was applied to does not match the contents of that column. Users should take care
to avoid this kind of inconsistent state.

Expand Down
2 changes: 1 addition & 1 deletion docs/ml-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [

A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.

Columns in a `DataFrame` are named. The code examples below use names such as "text," "features," and "label."
Columns in a `DataFrame` are named. The code examples below use names such as "text", "features", and "label".

## Pipeline components

Expand Down
4 changes: 2 additions & 2 deletions docs/mllib-linear-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ In `spark.mllib`, the first class $0$ is chosen as the "pivot" class.
See Section 4.4 of
[The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
references.
Here is an
Here is a
[detailed mathematical derivation](http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297).

For multiclass classification problems, the algorithm will output a multinomial logistic regression
Expand Down Expand Up @@ -350,7 +350,7 @@ known as the [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_erro
<div class="codetabs">

<div data-lang="scala" markdown="1">
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
The following example demonstrates how to load training data, parse it as an RDD of LabeledPoint.
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
Expand Down
8 changes: 4 additions & 4 deletions docs/rdd-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ One important parameter for parallel collections is the number of *partitions* t

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).

Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes a URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:

{% highlight scala %}
scala> val distFile = sc.textFile("data.txt")
Expand Down Expand Up @@ -365,7 +365,7 @@ Apart from text files, Spark's Scala API also supports several other data format

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).

Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes a URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:

{% highlight java %}
JavaRDD<String> distFile = sc.textFile("data.txt");
Expand Down Expand Up @@ -397,7 +397,7 @@ Apart from text files, Spark's Java API also supports several other data formats

PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).

Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes a URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:

{% highlight python %}
>>> distFile = sc.textFile("data.txt")
Expand Down Expand Up @@ -1122,7 +1122,7 @@ costly operation.

#### Background

To understand what happens during the shuffle we can consider the example of the
To understand what happens during the shuffle, we can consider the example of the
[`reduceByKey`](#ReduceByLink) operation. The `reduceByKey` operation generates a new RDD where all
values for a single key are combined into a tuple - the key and the result of executing a reduce
function against all values associated with that key. The challenge is that not all values for a
Expand Down
2 changes: 1 addition & 1 deletion docs/running-on-mesos.md
Original file line number Diff line number Diff line change
Expand Up @@ -687,7 +687,7 @@ See the [configuration page](configuration.html) for information on Spark config
<td><code>0</code></td>
<td>
Set the maximum number GPU resources to acquire for this job. Note that executors will still launch when no GPU resources are found
since this configuration is just a upper limit and not a guaranteed amount.
since this configuration is just an upper limit and not a guaranteed amount.
</td>
</tr>
<tr>
Expand Down
2 changes: 1 addition & 1 deletion docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,7 +337,7 @@ Configuration for SSL is organized hierarchically. The user can configure the de
which will be used for all the supported communication protocols unless they are overwritten by
protocol-specific settings. This way the user can easily provide the common settings for all the
protocols without disabling the ability to configure each one individually. The following table
describes the the SSL configuration namespaces:
describes the SSL configuration namespaces:

<table class="table">
<tr>
Expand Down
2 changes: 1 addition & 1 deletion docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,7 @@ head(agg(rollup(df, "cyl", "disp", "gear"), avg(df$mpg)))

### Operating on Columns

SparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions.
SparkR also provides a number of functions that can be directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions.

<div data-lang="r" markdown="1">
{% highlight r %}
Expand Down
6 changes: 3 additions & 3 deletions docs/sql-data-sources-avro.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,9 @@ write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro")
## to_avro() and from_avro()
The Avro package provides function `to_avro` to encode a column as binary in Avro
format, and `from_avro()` to decode Avro binary data into a column. Both functions transform one column to
another column, and the input/output SQL data type can be complex type or primitive type.
another column, and the input/output SQL data type can be a complex type or a primitive type.

Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each
Using Avro record as columns is useful when reading from or writing to a streaming source like Kafka. Each
Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
Expand Down Expand Up @@ -151,7 +151,7 @@ Data source options of Avro can be set via:
<tr>
<td><code>avroSchema</code></td>
<td>None</td>
<td>Optional Avro schema provided by an user in JSON format. The date type and naming of record fields
<td>Optional Avro schema provided by a user in JSON format. The date type and naming of record fields
should match the input Avro data or Catalyst data, otherwise the read/write action will fail.</td>
<td>read and write</td>
</tr>
Expand Down
2 changes: 1 addition & 1 deletion docs/sql-data-sources-hive-tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ creating table, you can create a table using storage handler at Hive side, and u
<td><code>inputFormat, outputFormat</code></td>
<td>
These 2 options specify the name of a corresponding `InputFormat` and `OutputFormat` class as a string literal,
e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in pair, and you can not
e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in a pair, and you can not
specify them if you already specified the `fileFormat` option.
</td>
</tr>
Expand Down
2 changes: 1 addition & 1 deletion docs/sql-data-sources-jdbc.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ the following case-insensitive options:
as a subquery in the <code>FROM</code> clause. Spark will also assign an alias to the subquery clause.
As an example, spark will issue a query of the following form to the JDBC Source.<br><br>
<code> SELECT &lt;columns&gt; FROM (&lt;user_specified_query&gt;) spark_gen_alias</code><br><br>
Below are couple of restrictions while using this option.<br>
Below are a couple of restrictions while using this option.<br>
<ol>
<li> It is not allowed to specify `dbtable` and `query` options at the same time. </li>
<li> It is not allowed to specify `query` and `partitionColumn` options at the same time. When specifying
Expand Down
2 changes: 1 addition & 1 deletion docs/sql-data-sources-load-save-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -324,4 +324,4 @@ CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
`partitionBy` creates a directory structure as described in the [Partition Discovery](sql-data-sources-parquet.html#partition-discovery) section.
Thus, it has limited applicability to columns with high cardinality. In contrast
`bucketBy` distributes
data across a fixed number of buckets and can be used when a number of unique values is unbounded.
data across a fixed number of buckets and can be used when the number of unique values is unbounded.
2 changes: 1 addition & 1 deletion docs/sql-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Here we include some basic examples of structured data processing using Datasets
<div data-lang="scala" markdown="1">
{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}

For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset).
For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset).

In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.functions$).
</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ title: Spark SQL and DataFrames
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
interact with Spark SQL including SQL and the Dataset API. When computing a result
interact with Spark SQL including SQL and the Dataset API. When computing a result,
the same execution engine is used, independent of which API/language you are using to express the
computation. This unification means that developers can easily switch back and forth between
different APIs based on which provides the most natural way to express a given transformation.
Expand Down
2 changes: 1 addition & 1 deletion docs/sql-pyspark-pandas-with-arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p

Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
`ArrayType` of `TimestampType`, and nested `StructType`. `BinaryType` is supported only when
installed PyArrow is equal to or higher then 0.10.0.
installed PyArrow is equal to or higher than 0.10.0.

### Setting Arrow Batch Size

Expand Down
6 changes: 3 additions & 3 deletions docs/sql-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,15 @@ Spark SQL and DataFrames support the following data types:
elements with the type of `elementType`. `containsNull` is used to indicate if
elements in a `ArrayType` value can have `null` values.
- `MapType(keyType, valueType, valueContainsNull)`:
Represents values comprising a set of key-value pairs. The data type of keys are
described by `keyType` and the data type of values are described by `valueType`.
Represents values comprising a set of key-value pairs. The data type of keys is
described by `keyType` and the data type of values is described by `valueType`.
For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull`
is used to indicate if values of a `MapType` value can have `null` values.
- `StructType(fields)`: Represents values with the structure described by
a sequence of `StructField`s (`fields`).
* `StructField(name, dataType, nullable)`: Represents a field in a `StructType`.
The name of a field is indicated by `name`. The data type of a field is indicated
by `dataType`. `nullable` is used to indicate if values of this fields can have
by `dataType`. `nullable` is used to indicate if values of these fields can have
`null` values.

<div class="codetabs">
Expand Down
2 changes: 1 addition & 1 deletion docs/streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -733,7 +733,7 @@ for Java, and [StreamingContext](api/python/pyspark.streaming.html#pyspark.strea
<span class="badge" style="background-color: grey">Python API</span> As of Spark {{site.SPARK_VERSION_SHORT}},
out of these sources, Kafka and Kinesis are available in the Python API.

This category of sources require interfacing with external non-Spark libraries, some of them with
This category of sources requires interfacing with external non-Spark libraries, some of them with
complex dependencies (e.g., Kafka). Hence, to minimize issues related to version conflicts
of dependencies, the functionality to create DStreams from these sources has been moved to separate
libraries that can be [linked](#linking) to explicitly when necessary.
Expand Down
Loading