KCL continues to hold leases after unexpected shutdown due to Error #1207

dbottini · 2023-09-12T23:12:04Z

Hello, we identified an edge condition in how KCL responds to Error type Throwable (as opposed to Exception), that leaves a broken KCL that holds leases indefinitely, but doesn't process from shards or do any other work.

KCL:2.4.8

In our scenario, we had an issue where we could not load a class at runtime in our RecordProcessor.processRecords implementation.

java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
	at service.processor.primary.PrimaryRecordProcessor.getBaseEvent(PrimaryRecordProcessor.java:463)
	at service.processor.primary.PrimaryRecordProcessor.lambda$processRecords$0(PrimaryRecordProcessor.java:108)
	at java.base/java.util.ArrayList.forEach(Unknown Source)
	at service.processor.primary.PrimaryRecordProcessor.processRecords(PrimaryRecordProcessor.java:105)
	at software.amazon.kinesis.lifecycle.ProcessTask.callProcessRecords(ProcessTask.java:224)
	at software.amazon.kinesis.lifecycle.ProcessTask.call(ProcessTask.java:162)
	at software.amazon.kinesis.lifecycle.ShardConsumer.executeTask(ShardConsumer.java:336)
	at software.amazon.kinesis.lifecycle.ShardConsumer.processData(ShardConsumer.java:322)
	at software.amazon.kinesis.lifecycle.ShardConsumer.handleInput(ShardConsumer.java:156)
	at software.amazon.kinesis.lifecycle.ShardConsumerSubscriber.onNext(ShardConsumerSubscriber.java:158)
	at software.amazon.kinesis.lifecycle.ShardConsumerSubscriber.onNext(ShardConsumerSubscriber.java:36)
	at software.amazon.kinesis.lifecycle.NotifyingSubscriber.onNext(NotifyingSubscriber.java:56)
	at software.amazon.kinesis.lifecycle.NotifyingSubscriber.onNext(NotifyingSubscriber.java:27)
	at io.reactivex.rxjava3.internal.util.HalfSerializer.onNext(HalfSerializer.java:46)
	at io.reactivex.rxjava3.internal.subscribers.StrictSubscriber.onNext(StrictSubscriber.java:97)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableObserveOn$ObserveOnSubscriber.runAsync(FlowableObserveOn.java:403)
	at io.reactivex.rxjava3.internal.operators.flowable.FlowableObserveOn$BaseObserveOnSubscriber.run(FlowableObserveOn.java:178)
	at io.reactivex.rxjava3.internal.schedulers.ExecutorScheduler$ExecutorWorker$BooleanRunnable.run(ExecutorScheduler.java:324)
	at io.reactivex.rxjava3.internal.schedulers.ExecutorScheduler$ExecutorWorker.runEager(ExecutorScheduler.java:289)
	at io.reactivex.rxjava3.internal.schedulers.ExecutorScheduler$ExecutorWorker.run(ExecutorScheduler.java:250)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib [in thread "ShardRecordProcessor-0001"]
	at java.base/java.lang.ClassLoader.loadLibrary(Unknown Source)
	at java.base/java.lang.Runtime.loadLibrary0(Unknown Source)
	at java.base/java.lang.System.loadLibrary(Unknown Source)
	at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:185)
	at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:157)
	at org.xerial.snappy.Snappy.init(Snappy.java:70)
	at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)
	... 23 common frames omitted

While we clearly have issues of our own, the consequences of this error is that it gets caught as a Throwable t and propagated as a dispatchFailure. https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/lifecycle/ShardConsumerSubscriber.java#L162-L166
The Throwable t is then propagated upwards from ShardConsumer https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/lifecycle/ShardConsumer.java#L202-L205 and then after that through the runProcessLoop as Error is not an Exception
It then skips up and out of Scheduler.run without triggering any kind of shutdown aside from ending the scheduler thread or even an error log that KCL is ending abruptly. https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/coordinator/Scheduler.java#L323-L328

It may be worth noting that we start our KCL using the following:

ExecutorService executor = Executors.newCachedThreadPool();
executor.submit(scheduler);

This is a pretty standard model, but it is fundamentally different from the KCL documentation's model. We think our mechanism is pretty standard, but it may affect error propagation in a case like this - it's also hard to tell whether the KCL docs are set up this way because it's a one-file example application or because this is an implicit requirement.

Thread schedulerThread = new Thread(scheduler);
schedulerThread.setDaemon(true);
schedulerThread.start();

What I think KCL should handle differently is that the Scheduler halting would trigger better cleanups of resources and executor service, even if as anomalously as an Error.

Ideally I'd love to see things like a healthcheck on KCL that we can call to identify an unhealthy KCL process, which we can just poll occasionally as part of normal service healthcheck operations.

The text was updated successfully, but these errors were encountered:

stair-aws added the v2.x Issues related to the 2.x version label Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCL continues to hold leases after unexpected shutdown due to Error #1207

KCL continues to hold leases after unexpected shutdown due to Error #1207

dbottini commented Sep 12, 2023

KCL continues to hold leases after unexpected shutdown due to Error #1207

KCL continues to hold leases after unexpected shutdown due to Error #1207

Comments

dbottini commented Sep 12, 2023