You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-19584][SS][DOCS] update structured streaming documentation around batch mode
## What changes were proposed in this pull request?
Revision to structured-streaming-kafka-integration.md to reflect new Batch query specification and options.
zsxwing tdas
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Tyson Condie <tcondie@gmail.com>
Closes#16918 from tcondie/kafka-docs.
<td>"latest" for streaming, "earliest" for batch</td>
316
+
<td>streaming and batch</td>
197
317
<td>The start point when a query is started, either "earliest" which is from the earliest offsets,
198
318
"latest" which is just from the latest offsets, or a json string specifying a starting offset for
199
319
each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest.
200
-
Note: This only applies when a new Streaming query is started, and that resuming will always pick
201
-
up from where the query left off. Newly discovered partitions during a query will start at
320
+
Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed.
321
+
For streaming queries, this only applies when a new query is started, and that resuming will
322
+
always pick up from where the query left off. Newly discovered partitions during a query will start at
202
323
earliest.</td>
203
324
</tr>
325
+
<tr>
326
+
<td>endingOffsets</td>
327
+
<td>latest or json string
328
+
{"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}
329
+
</td>
330
+
<td>latest</td>
331
+
<td>batch query</td>
332
+
<td>The end point when a batch query is ended, either "latest" which is just referred to the
333
+
latest, or a json string specifying an ending offset for each TopicPartition. In the json, -1
334
+
as an offset can be used to refer to latest, and -2 (earliest) as an offset is not allowed.</td>
335
+
</tr>
204
336
<tr>
205
337
<td>failOnDataLoss</td>
206
338
<td>true or false</td>
207
339
<td>true</td>
208
-
<td>Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or
340
+
<td>streaming query</td>
341
+
<td>Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or
209
342
offsets are out of range). This may be a false alarm. You can disable it when it doesn't work
210
-
as you expected.</td>
343
+
as you expected. Batch queries will always fail if it fails to read any data from the provided
344
+
offsets due to lost data.</td>
211
345
</tr>
212
346
<tr>
213
347
<td>kafkaConsumer.pollTimeoutMs</td>
214
348
<td>long</td>
215
349
<td>512</td>
350
+
<td>streaming and batch</td>
216
351
<td>The timeout in milliseconds to poll data from Kafka in executors.</td>
217
352
</tr>
218
353
<tr>
219
354
<td>fetchOffset.numRetries</td>
220
355
<td>int</td>
221
356
<td>3</td>
222
-
<td>Number of times to retry before giving up fatch Kafka latest offsets.</td>
357
+
<td>streaming and batch</td>
358
+
<td>Number of times to retry before giving up fetching Kafka offsets.</td>
223
359
</tr>
224
360
<tr>
225
361
<td>fetchOffset.retryIntervalMs</td>
226
362
<td>long</td>
227
363
<td>10</td>
364
+
<td>streaming and batch</td>
228
365
<td>milliseconds to wait before retrying to fetch Kafka offsets</td>
229
366
</tr>
230
367
<tr>
231
368
<td>maxOffsetsPerTrigger</td>
232
369
<td>long</td>
233
370
<td>none</td>
371
+
<td>streaming and batch</td>
234
372
<td>Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.</td>
235
373
</tr>
236
374
</table>
@@ -246,7 +384,7 @@ Note that the following Kafka params cannot be set and the Kafka source will thr
246
384
where to start instead. Structured Streaming manages which offsets are consumed internally, rather
247
385
than rely on the kafka Consumer to do it. This will ensure that no data is missed when new
248
386
topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new
249
-
Streaming query is started, and that resuming will always pick up from where the query left off.
387
+
streaming query is started, and that resuming will always pick up from where the query left off.
250
388
-**key.deserializer**: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use
251
389
DataFrame operations to explicitly deserialize the keys.
252
390
-**value.deserializer**: Values are always deserialized as byte arrays with ByteArrayDeserializer.
0 commit comments