You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## What changes were proposed in this pull request?
Fix Typos.
## How was this patch tested?
NA
Closes#23145 from kjmrknsn/docUpdate.
Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Copy file name to clipboardExpand all lines: docs/rdd-programming-guide.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -332,7 +332,7 @@ One important parameter for parallel collections is the number of *partitions* t
332
332
333
333
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).
334
334
335
-
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
335
+
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes a URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
336
336
337
337
{% highlight scala %}
338
338
scala> val distFile = sc.textFile("data.txt")
@@ -365,7 +365,7 @@ Apart from text files, Spark's Scala API also supports several other data format
365
365
366
366
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).
367
367
368
-
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
368
+
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes a URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
@@ -397,7 +397,7 @@ Apart from text files, Spark's Java API also supports several other data formats
397
397
398
398
PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop [InputFormat](http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html).
399
399
400
-
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
400
+
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes a URI for the file (either a local path on the machine, or a `hdfs://`, `s3a://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
401
401
402
402
{% highlight python %}
403
403
>>> distFile = sc.textFile("data.txt")
@@ -1122,7 +1122,7 @@ costly operation.
1122
1122
1123
1123
#### Background
1124
1124
1125
-
To understand what happens during the shuffle we can consider the example of the
1125
+
To understand what happens during the shuffle, we can consider the example of the
1126
1126
[`reduceByKey`](#ReduceByLink) operation. The `reduceByKey` operation generates a new RDD where all
1127
1127
values for a single key are combined into a tuple - the key and the result of executing a reduce
1128
1128
function against all values associated with that key. The challenge is that not all values for a
The Avro package provides function `to_avro` to encode a column as binary in Avro
68
68
format, and `from_avro()` to decode Avro binary data into a column. Both functions transform one column to
69
-
another column, and the input/output SQL data type can be complex type or primitive type.
69
+
another column, and the input/output SQL data type can be a complex type or a primitive type.
70
70
71
-
Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each
71
+
Using Avro record as columns is useful when reading from or writing to a streaming source like Kafka. Each
72
72
Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
73
73
* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
74
74
*`to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
@@ -151,7 +151,7 @@ Data source options of Avro can be set via:
151
151
<tr>
152
152
<td><code>avroSchema</code></td>
153
153
<td>None</td>
154
-
<td>Optional Avro schema provided by an user in JSON format. The date type and naming of record fields
154
+
<td>Optional Avro schema provided by a user in JSON format. The date type and naming of record fields
155
155
should match the input Avro data or Catalyst data, otherwise the read/write action will fail.</td>
For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset).
102
+
For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset).
103
103
104
104
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.functions$).
0 commit comments