Skip to content

Commit ff80970

Browse files
committed
More updates to streaming guide.
1 parent 4dc42e9 commit ff80970

File tree

2 files changed

+18
-14
lines changed

2 files changed

+18
-14
lines changed

docs/configuration.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -463,7 +463,7 @@ Apart from these, the following properties are also available, and may be useful
463463
<td>(infinite)</td>
464464
<td>
465465
Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
466-
Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is
466+
Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is
467467
useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
468468
applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
469469
</td>
@@ -472,8 +472,8 @@ Apart from these, the following properties are also available, and may be useful
472472
<td>spark.streaming.blockInterval</td>
473473
<td>200</td>
474474
<td>
475-
Duration (milliseconds) of how long to batch new objects coming from receivers used
476-
in Spark Streaming.
475+
Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
476+
into blocks of data before storing them in Spark.
477477
</td>
478478
</tr>
479479
<tr>

docs/streaming-programming-guide.md

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -891,7 +891,8 @@ improve the performance of you application. At a high level, you need to conside
891891
Reducing the processing time of each batch of data by efficiently using cluster resources.
892892
</li>
893893
<li>
894-
Setting the right batch size such that the data processing can keep up with the data ingestion.
894+
Setting the right batch size such that the batches of data can be processed as fast as they
895+
are received (that is, data processing keeps up with the data ingestion).
895896
</li>
896897
</ol>
897898

@@ -901,13 +902,15 @@ each batch. These have been discussed in detail in [Tuning Guide](tuning.html).
901902
highlights some of the most important ones.
902903

903904
### Level of Parallelism in Data Receiving
904-
Since the receiver of each input stream (other than file stream) runs on a single worker, often
905-
that proves to be the bottleneck in increasing the throughput. Consider receiving the data
906-
in parallel through multiple receivers. This can be done by creating two input streams and
907-
configuring them receive different partitions of the data stream from the data source(s).
908-
For example, a single Kafka stream receiving two topics of data can split into two
909-
Kafka streams receiving one topic each. This would run two receivers on two workers, thus allowing
910-
data to received in parallel, and increasing overall throughput.
905+
Receiving data over the network (like Kafka, Flume, socket, etc.) requires the data to deserialized
906+
and stored in Spark. If the data receiving becomes a bottleneck in the system, then consider
907+
parallelizing the data receiving. Note that each input DStream
908+
creates a single receiver (running on a worker machine) that receives a single stream of data.
909+
Receiving multiple data streams can therefore be achieved by creating multiple input DStreams
910+
and configuring them to receive different partitions of the data stream from the source(s).
911+
For example, a single Kafka input stream receiving two topics of data can be split into two
912+
Kafka input streams, each receiving only one topic. This would run two receivers on two workers,
913+
thus allowing data to received in parallel, and increasing overall throughput.
911914

912915
Another parameter that should be considered is the receiver's blocking interval. For most receivers,
913916
the received data is coalesced together into large blocks of data before storing inside Spark's memory.
@@ -916,9 +919,10 @@ the received data in a map-like transformation. This blocking interval is determ
916919
[configuration parameter](configuration.html) `spark.streaming.blockInterval` and the default value
917920
is 200 milliseconds.
918921

919-
If it is infeasible to parallelize the receiving using multiple input streams / receivers, it is sometimes beneficial to explicitly repartition the input data stream
920-
(using `inputStream.repartition(<number of partitions>)`) to distribute the received
921-
data across all the machines in the cluster before further processing.
922+
An alternative to receiving data with multiple input streams / receivers is to explicitly repartition
923+
the input data stream (using `inputStream.repartition(<number of partitions>)`).
924+
This distributes the received batches of data across all the machines in the cluster
925+
before further processing.
922926

923927
### Level of Parallelism in Data Processing
924928
Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the

0 commit comments

Comments
 (0)