@@ -428,9 +428,9 @@ KafkaUtils.createStream(javaStreamingContext, kafkaParams, ...);
428
428
</div >
429
429
</div >
430
430
431
- For more details on these additional sources, see the corresponding [ API documentation]
432
- (#where-to-go-from-here). Furthermore, you can also implement your own custom receiver
433
- for your sources. See the [ Custom Receiver Guide] ( streaming-custom-receivers.html ) .
431
+ For more details on these additional sources, see the corresponding [ API documentation] ( #where-to-go-from-here ) .
432
+ Furthermore, you can also implement your own custom receiver for your sources. See the
433
+ [ Custom Receiver Guide] ( streaming-custom-receivers.html ) .
434
434
435
435
## Operations
436
436
There are two kinds of DStream operations - _ transformations_ and _ output operations_ . Similar to
@@ -520,9 +520,8 @@ The last two transformations are worth highlighting again.
520
520
521
521
<h4 >UpdateStateByKey Operation</h4 >
522
522
523
- The ` updateStateByKey ` operation allows
524
- you to main arbitrary stateful computation, where you want to maintain some state data and
525
- continuously update it with new information. To use this, you will have to do two steps.
523
+ The ` updateStateByKey ` operation allows you to maintain arbitrary state while continuously updating
524
+ it with new information. To use this, you will have to do two steps.
526
525
527
526
1 . Define the state - The state can be of arbitrary data type.
528
527
1 . Define the state update function - Specify with a function how to update the state using the
@@ -925,7 +924,7 @@ exception saying so.
925
924
## Monitoring
926
925
Besides Spark's in-built [ monitoring capabilities] ( monitoring.html ) ,
927
926
the progress of a Spark Streaming program can also be monitored using the [ StreamingListener]
928
- (streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
927
+ (api/ streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
929
928
which allows you to get statistics of batch processing times, queueing delays,
930
929
and total end-to-end delays. Note that this is still an experimental API and it is likely to be
931
930
improved upon (i.e., more information reported) in the future.
@@ -1000,11 +999,11 @@ Since all data is modeled as RDDs with their lineage of deterministic operations
1000
999
for output operations.
1001
1000
1002
1001
## Failure of the Driver Node
1003
- To allows a streaming application to operate 24/7, Spark Streaming allows a streaming computation
1002
+ For a streaming application to operate 24/7, Spark Streaming allows a streaming computation
1004
1003
to be resumed even after the failure of the driver node. Spark Streaming periodically writes the
1005
1004
metadata information of the DStreams setup through the ` StreamingContext ` to a
1006
1005
HDFS directory (can be any Hadoop-compatible filesystem). This periodic
1007
- * checkpointing* can be enabled by setting a the checkpoint
1006
+ * checkpointing* can be enabled by setting the checkpoint
1008
1007
directory using ` ssc.checkpoint(<checkpoint directory>) ` as described
1009
1008
[ earlier] ( #rdd-checkpointing ) . On failure of the driver node,
1010
1009
the lost ` StreamingContext ` can be recovered from this information, and restarted.
@@ -1105,8 +1104,8 @@ classes. So, if you are using `getOrCreate`, then make sure that the checkpoint
1105
1104
explicitly deleted every time recompiled code needs to be launched.
1106
1105
1107
1106
This failure recovery can be done automatically using Spark's
1108
- [ standalone cluster mode] ( spark-standalone.html ) , which allows any Spark
1109
- application's driver to be as well as, ensures automatic restart of the driver on failure (see
1107
+ [ standalone cluster mode] ( spark-standalone.html ) , which allows the driver of any Spark application
1108
+ to be launched within the cluster and be restarted on failure (see
1110
1109
[ supervise mode] ( spark-standalone.html#launching-applications-inside-the-cluster ) ). This can be
1111
1110
tested locally by launching the above example using the supervise mode in a
1112
1111
local standalone cluster and killing the java process running the driver (will be shown as
@@ -1123,7 +1122,7 @@ There are two different failure behaviors based on which input sources are used.
1123
1122
1 . _ Using HDFS files as input source_ - Since the data is reliably stored on HDFS, all data can
1124
1123
re-computed and therefore no data will be lost due to any failure.
1125
1124
1 . _ Using any input source that receives data through a network_ - The received input data is
1126
- replicated in memory to multiple nodes. Since, all the data in the Spark worker's memory is lost
1125
+ replicated in memory to multiple nodes. Since all the data in the Spark worker's memory is lost
1127
1126
when the Spark driver fails, the past input data will not be accessible and driver recovers.
1128
1127
Hence, if stateful and window-based operations are used
1129
1128
(like ` updateStateByKey ` , ` window ` , ` countByValueAndWindow ` , etc.), then the intermediate state
@@ -1133,11 +1132,11 @@ In future releases, we will support full recoverability for all input sources. N
1133
1132
non-stateful transformations like ` map ` , ` count ` , and ` reduceByKey ` , with _ all_ input streams,
1134
1133
the system, upon restarting, will continue to receive and process new data.
1135
1134
1136
- To better understand the behavior of the system under driver failure with a HDFS source, lets
1135
+ To better understand the behavior of the system under driver failure with a HDFS source, let's
1137
1136
consider what will happen with a file input stream. Specifically, in the case of the file input
1138
1137
stream, it will correctly identify new files that were created while the driver was down and
1139
1138
process them in the same way as it would have if the driver had not failed. To explain further
1140
- in the case of file input stream, we shall use an example. Lets say, files are being generated
1139
+ in the case of file input stream, we shall use an example. Let's say, files are being generated
1141
1140
every second, and a Spark Streaming program reads every new file and output the number of lines
1142
1141
in the file. This is what the sequence of outputs would be with and without a driver failure.
1143
1142
0 commit comments