You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, we identified an edge condition in how KCL responds to Error type Throwable (as opposed to Exception), that leaves a broken KCL that holds leases indefinitely, but doesn't process from shards or do any other work.
KCL:2.4.8
In our scenario, we had an issue where we could not load a class at runtime in our RecordProcessor.processRecords implementation.
java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
at service.processor.primary.PrimaryRecordProcessor.getBaseEvent(PrimaryRecordProcessor.java:463)
at service.processor.primary.PrimaryRecordProcessor.lambda$processRecords$0(PrimaryRecordProcessor.java:108)
at java.base/java.util.ArrayList.forEach(Unknown Source)
at service.processor.primary.PrimaryRecordProcessor.processRecords(PrimaryRecordProcessor.java:105)
at software.amazon.kinesis.lifecycle.ProcessTask.callProcessRecords(ProcessTask.java:224)
at software.amazon.kinesis.lifecycle.ProcessTask.call(ProcessTask.java:162)
at software.amazon.kinesis.lifecycle.ShardConsumer.executeTask(ShardConsumer.java:336)
at software.amazon.kinesis.lifecycle.ShardConsumer.processData(ShardConsumer.java:322)
at software.amazon.kinesis.lifecycle.ShardConsumer.handleInput(ShardConsumer.java:156)
at software.amazon.kinesis.lifecycle.ShardConsumerSubscriber.onNext(ShardConsumerSubscriber.java:158)
at software.amazon.kinesis.lifecycle.ShardConsumerSubscriber.onNext(ShardConsumerSubscriber.java:36)
at software.amazon.kinesis.lifecycle.NotifyingSubscriber.onNext(NotifyingSubscriber.java:56)
at software.amazon.kinesis.lifecycle.NotifyingSubscriber.onNext(NotifyingSubscriber.java:27)
at io.reactivex.rxjava3.internal.util.HalfSerializer.onNext(HalfSerializer.java:46)
at io.reactivex.rxjava3.internal.subscribers.StrictSubscriber.onNext(StrictSubscriber.java:97)
at io.reactivex.rxjava3.internal.operators.flowable.FlowableObserveOn$ObserveOnSubscriber.runAsync(FlowableObserveOn.java:403)
at io.reactivex.rxjava3.internal.operators.flowable.FlowableObserveOn$BaseObserveOnSubscriber.run(FlowableObserveOn.java:178)
at io.reactivex.rxjava3.internal.schedulers.ExecutorScheduler$ExecutorWorker$BooleanRunnable.run(ExecutorScheduler.java:324)
at io.reactivex.rxjava3.internal.schedulers.ExecutorScheduler$ExecutorWorker.runEager(ExecutorScheduler.java:289)
at io.reactivex.rxjava3.internal.schedulers.ExecutorScheduler$ExecutorWorker.run(ExecutorScheduler.java:250)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib [in thread "ShardRecordProcessor-0001"]
at java.base/java.lang.ClassLoader.loadLibrary(Unknown Source)
at java.base/java.lang.Runtime.loadLibrary0(Unknown Source)
at java.base/java.lang.System.loadLibrary(Unknown Source)
at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:185)
at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:157)
at org.xerial.snappy.Snappy.init(Snappy.java:70)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)
... 23 common frames omitted
This is a pretty standard model, but it is fundamentally different from the KCL documentation's model. We think our mechanism is pretty standard, but it may affect error propagation in a case like this - it's also hard to tell whether the KCL docs are set up this way because it's a one-file example application or because this is an implicit requirement.
Thread schedulerThread = new Thread(scheduler);
schedulerThread.setDaemon(true);
schedulerThread.start();
What I think KCL should handle differently is that the Scheduler halting would trigger better cleanups of resources and executor service, even if as anomalously as an Error.
Ideally I'd love to see things like a healthcheck on KCL that we can call to identify an unhealthy KCL process, which we can just poll occasionally as part of normal service healthcheck operations.
The text was updated successfully, but these errors were encountered:
Hello, we identified an edge condition in how KCL responds to
Error
typeThrowable
(as opposed toException
), that leaves a broken KCL that holds leases indefinitely, but doesn't process from shards or do any other work.KCL:2.4.8
In our scenario, we had an issue where we could not load a class at runtime in our
RecordProcessor.processRecords
implementation.Throwable t
and propagated as adispatchFailure
. https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/lifecycle/ShardConsumerSubscriber.java#L162-L166Throwable t
is then propagated upwards fromShardConsumer
https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/lifecycle/ShardConsumer.java#L202-L205 and then after that through therunProcessLoop
as Error is not an ExceptionScheduler.run
without triggering any kind of shutdown aside from ending the scheduler thread or even an error log that KCL is ending abruptly. https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/coordinator/Scheduler.java#L323-L328It may be worth noting that we start our KCL using the following:
This is a pretty standard model, but it is fundamentally different from the KCL documentation's model. We think our mechanism is pretty standard, but it may affect error propagation in a case like this - it's also hard to tell whether the KCL docs are set up this way because it's a one-file example application or because this is an implicit requirement.
What I think KCL should handle differently is that the Scheduler halting would trigger better cleanups of resources and executor service, even if as anomalously as an Error.
Ideally I'd love to see things like a healthcheck on KCL that we can call to identify an unhealthy KCL process, which we can just poll occasionally as part of normal service healthcheck operations.
The text was updated successfully, but these errors were encountered: