-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nakadi clients resilience to partial outage and partial success #181
Comments
Thanks for the issue (I was already starting to write my own, but you were faster). Just to make it clear, Nakadi-Producer is not a Nakadi-client by itself, but implementing the outbox pattern based on Fahrschein for the submission, following the "eventual submission" paradigm. Current behavior
So short failures (< 10 minutes) of one partition will just repair themselves (and successful events will not be retried), but way longer failures will create similar patterns of "retrying increasingly growing batches of events which could not be submitted before, plus a few new ones". Expected Behavior I think we need these additional measures:
We also got related #172 for handling of validation errors – there the behavior is even worse, as all the events in one batch will be blocked when some of them have validation failures. |
Great to hear you are aware of the problem and have ways to address it! Thank you for the details!
Yes, it is either you change the queuing of the events(1) or you change the way how you grab them from the queue(2). I would prefer the second option if possible because then you can mix them up into the batch once they reach the retry time or completely retry only failed events. probably implementation will be easier (I do not know the details though). Anyways it has to be done to not blocking publishing to available partitions.
if my understand correct, that means after some time Nakadi will receive unlimited batch of events, could we change the default here to some reasonable value from your experience? |
"Queue" here was more figurative – it's actually more like an unordered set, I now realize looking at the code. There is the table of all events which need sending out, and the scheduler will repeatedly lock some events (events which are already locked are skipped by this), then try to send these events out, then delete the successful ones. The lock some events part is actually (other than I remembered) not specifying any ordering – I guess I mixed this up with some other service of one of my teams where this is implemented manually, and it had some I guess we could add something like
If the events are produced continuously, they will be first locked at different times, so at every time only a small number of these locks will expire. Only these ones will be retried at every time, so I would normally not expect a really large number. Though if the producer is switched off for a while, and then switched on again, it could send out this huge batch at this point. We got this problem in the past (leading to an out-of-memory error in the producer), which is why we introduced this option at all. Back then, we kept the default at "unlimited" to stay compatible with the previous behavior. Unfortunately a reasonable limit here depends quite strongly on the size of events, so it's difficult to define a generic default in the library. Instead, we planned to make the configuration option a required one with some major version bump. I now created #182 to keep track of this. |
@ePaul any updates on the ticket ? |
Current status:
|
Nakadi publishing API accepts events in batches. It can fail to publish some events from the batch to underlying storage (Apache Kafka). In that case Nakadi publishing API will return error that batch was partially successful.
It can create problems the following problems, depending on how the Nakadi client and the publishing application deals with this partial success response:
The following should be done to decrease the possibility of mentioned problems:
Nakadi client should contain a note to developers that publishing can experience partial success. This should be in the client documentation and ideally also within the self contained code documentation, raising awareness for the users, e.g. via docstrings.
An optional retry method on batch level can be provided for the whole batch, but the default strategy must contain a backoff solution in case of continued errors to publish to Nakadi.
An optional retry method can be provided that only re-publishes unsuccessful events to Nakadi. This retry must also support a backoff strategy by default.
Clients must expose the result of a publishing request in a way that developers can understand that there is the possibility of a partial success for batch publishing.
The text was updated successfully, but these errors were encountered: