You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/streaming-programming-guide.md
+15-11Lines changed: 15 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -891,7 +891,8 @@ improve the performance of you application. At a high level, you need to conside
891
891
Reducing the processing time of each batch of data by efficiently using cluster resources.
892
892
</li>
893
893
<li>
894
-
Setting the right batch size such that the data processing can keep up with the data ingestion.
894
+
Setting the right batch size such that the batches of data can be processed as fast as they
895
+
are received (that is, data processing keeps up with the data ingestion).
895
896
</li>
896
897
</ol>
897
898
@@ -901,13 +902,15 @@ each batch. These have been discussed in detail in [Tuning Guide](tuning.html).
901
902
highlights some of the most important ones.
902
903
903
904
### Level of Parallelism in Data Receiving
904
-
Since the receiver of each input stream (other than file stream) runs on a single worker, often
905
-
that proves to be the bottleneck in increasing the throughput. Consider receiving the data
906
-
in parallel through multiple receivers. This can be done by creating two input streams and
907
-
configuring them receive different partitions of the data stream from the data source(s).
908
-
For example, a single Kafka stream receiving two topics of data can split into two
909
-
Kafka streams receiving one topic each. This would run two receivers on two workers, thus allowing
910
-
data to received in parallel, and increasing overall throughput.
905
+
Receiving data over the network (like Kafka, Flume, socket, etc.) requires the data to deserialized
906
+
and stored in Spark. If the data receiving becomes a bottleneck in the system, then consider
907
+
parallelizing the data receiving. Note that each input DStream
908
+
creates a single receiver (running on a worker machine) that receives a single stream of data.
909
+
Receiving multiple data streams can therefore be achieved by creating multiple input DStreams
910
+
and configuring them to receive different partitions of the data stream from the source(s).
911
+
For example, a single Kafka input stream receiving two topics of data can be split into two
912
+
Kafka input streams, each receiving only one topic. This would run two receivers on two workers,
913
+
thus allowing data to received in parallel, and increasing overall throughput.
911
914
912
915
Another parameter that should be considered is the receiver's blocking interval. For most receivers,
913
916
the received data is coalesced together into large blocks of data before storing inside Spark's memory.
@@ -916,9 +919,10 @@ the received data in a map-like transformation. This blocking interval is determ
916
919
[configuration parameter](configuration.html)`spark.streaming.blockInterval` and the default value
917
920
is 200 milliseconds.
918
921
919
-
If it is infeasible to parallelize the receiving using multiple input streams / receivers, it is sometimes beneficial to explicitly repartition the input data stream
920
-
(using `inputStream.repartition(<number of partitions>)`) to distribute the received
921
-
data across all the machines in the cluster before further processing.
922
+
An alternative to receiving data with multiple input streams / receivers is to explicitly repartition
923
+
the input data stream (using `inputStream.repartition(<number of partitions>)`).
924
+
This distributes the received batches of data across all the machines in the cluster
925
+
before further processing.
922
926
923
927
### Level of Parallelism in Data Processing
924
928
Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the
0 commit comments