Remove lastReadSequenceNumber.isEmpty condition #90

chadlagore · 2020-08-26T22:41:13Z

Fixes bug #87.

This condition causes an infinite loop and throttling from AWS when the shard is empty. In my example on the ticket, I show it hitting the Kinesis API iteratively for minutes before giving up on the shard. We should more gracefully handle empty shards which admit no lastReadSequenceNumber no matter how many times you hit them sequentially.

Moreover, I believe the condition is unnecessary, because one can just increase maxReadTimeInMs if you want to spend a longer time reading on the shard.

elainearbaugh

makes sense to me. I don't quite understand the comment about stuck in the loop where we have data near the tip of the stream but we are not spending enough time to read it -- why would reading data at the beginning be any different than elsewhere?

chadlagore · 2020-09-02T00:54:56Z

Agreed. Here is an example of a consumer that is working from a set of empty shards running this code (these are the persisted checkpoint offsets):

# batch3 (no data)
v1
{"batchWatermarkMs":0,"batchTimestampMs":1599006537913,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"4"}}
{"metadata":{"streamName":"stream_name","batchId":"3"},"shardId-000000000000":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1599005556576"}}
{"metadata":{"streamName":"stream_name","batchId":"3"},"shardId-000000000000":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1599005556576"}}

# batch 4 (data appears)
v1
{"batchWatermarkMs":1599005056342,"batchTimestampMs":1599006671256,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"4"}}
{"metadata":{"streamName":"stream_name","batchId":"4"},"shardId-000000000000":{"iteratorType":"AFTER_SEQUENCE_NUMBER","iteratorPosition":"49609690170106349127857093631690192845261414844357148674"}}
{"metadata":{"streamName":"stream_name","batchId":"4"},"shardId-000000000000":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1599005556576"}}

# batch 5
v1
{"batchWatermarkMs":1599005056342,"batchTimestampMs":1599006691777,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"4"}}
{"metadata":{"streamName":"stream_name","batchId":"5"},"shardId-000000000000":{"iteratorType":"AFTER_SEQUENCE_NUMBER","iteratorPosition":"49609690170106349127857093631690192845261414844357148674"}}
{"metadata":{"streamName":"stream_name","batchId":"5"},"shardId-000000000000":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1599005556576"}}

Notice at batch 4, data appears and we move to the sequence number of the incoming data.

itsvikramagr · 2020-09-02T07:46:37Z

@chadlagore @elainearbaugh

In the code, I am handling the following scenario - Say we have started reading from time_horizon. So we need to make multiple get-records API calls to reach to a point where kinesis has data in it. (unfortunately unlike other data sources, Kinesis streams won't give the first available record in 1 API call). And if we dont reach the point where kinesis has data on it within the specified maxFetchTimeInMs, we will have a similar problem in the next micro-batch which will again start reading from trim_horizon. And this loop with continue and we would not process any data from that particular stream.

I agree that the current approach violates the meaning of maxFetchTimeInMs and will lead to AWS throttling when we are already on the tip of the stream and there is no new data to read.

Do you have any good ideas in handling the above-mentioned scenario?

chadlagore · 2020-09-07T21:28:15Z

Thanks for merging this @itsvikramagr. I believe the answer to your question is that in a call at TRIM_HORIZON (or any timestamp prior to the retention period), and with an otherwise empty stream, Kinesis is still able to return the new record at the tip of the stream, then this library moves the sequence number to the that point. The > 1 call GetRecords API Kinesis has is a bit odd, but I think that is characteristic of Kinesis addressed by the maxFetchTimeInMs param.

chadlagore added 4 commits August 24, 2020 08:15

remove condition

d5ba012

detect AT_TIMESTAMP

4199e34

remove pom changes

b98133a

remove condition altogether

57967cd

elainearbaugh reviewed Sep 1, 2020

View reviewed changes

itsvikramagr merged commit d64907a into qubole:2.4.0 Sep 7, 2020

itsvikramagr pushed a commit that referenced this pull request Sep 7, 2020

Remove lastReadSequenceNumber.isEmpty condition (#90)

5bd378b

juliankeppel mentioned this pull request Jan 8, 2021

ProvisionedThroughputExceededException and new batches are triggered even if avoidEmptyBatches=true #97

Open

ghost mentioned this pull request Jan 29, 2021

Issue with Stream containing a lot of NO PUT ACTIVITY #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove lastReadSequenceNumber.isEmpty condition #90

Remove lastReadSequenceNumber.isEmpty condition #90

Uh oh!

chadlagore commented Aug 26, 2020 •

edited

Loading

Uh oh!

elainearbaugh left a comment

Uh oh!

chadlagore commented Sep 2, 2020 •

edited

Loading

Uh oh!

itsvikramagr commented Sep 2, 2020 •

edited

Loading

Uh oh!

chadlagore commented Sep 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove lastReadSequenceNumber.isEmpty condition #90

Remove lastReadSequenceNumber.isEmpty condition #90

Uh oh!

Conversation

chadlagore commented Aug 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elainearbaugh left a comment

Choose a reason for hiding this comment

Uh oh!

chadlagore commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itsvikramagr commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chadlagore commented Sep 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chadlagore commented Aug 26, 2020 •

edited

Loading

chadlagore commented Sep 2, 2020 •

edited

Loading

itsvikramagr commented Sep 2, 2020 •

edited

Loading