-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dispatching timeouts on non DTC transports + storages can result in message loss #2885
Comments
@andreasohlund this https://groups.google.com/forum/#!topic/particularsoftware/ZMs8Q_gD2dw could be related |
@andreasohlund - is this a recent behavior change - do you know since what version we are behaving this way? |
By the way, Also, it is very unclear what the behavior is of the NHibernate TryRemove implementation: |
Assuming that This means there is a possibility of more than once delivery of the message only if no DTC is used because when its not a shared connection but a new one then this TX will be enlisted in DTC. In case of not using DTC, then the customer will probably already have the outbox enabled or its own idempotent processing to prevent more than once processing. The fix would be to simply reverse the items:
|
In regards with the NHibernate implementation there is some magic happening: The behavior is different depending on if:
When there is a shared transaction no actual commit will occur. The DELETE will be performed when the shared connection is committed. (I think when all message have been send, the behavior will commit?) When no shared connection and no ambient transaction then a connection and new transaction is created. So DELETE will be committed. When no shared connection and an ambient transaction then a new connection is created and enlisted in DTC. DELETE will be committed if ambient/DTC transaction is committed. |
Batching sends would only be useful if you can commit the DELETE after sending the messages or storing them in the outbox. If you do not have control over TM commit or do not have a transaction then, indeed, the message MUST be send or stored in the oubox immediately .
No, outbox / batching has no benefit as it does not change the order but we should send it via the outbox immediately if enabled.
My interpretation: The idea that we would only need at least once guarantees was wrong and we should use DTC when available. I don't understand this. When there is no DTC, then all processing should result in at least once processing. What do you mean? |
Its been this way at least since v4 |
The idea was the the timout infrastructure never would need to use the DTC even thought it is available to get better perf. This seems wrong to me now (it was just an idea , cc @SzymonPobiega ) |
I've reread this already 3-4 times and I still don't see the message loss scenario. It's probably the decaying engineering cortext in my now predominantly CEO brain.
That sounds like the original message is not lost. But the context was:
So what exactly is lost here? |
@udidahan if TM message that is based off a TM table record is not dispatched and TM table record is gone, the TM message will never happen again, hence the loss. |
On the topic of breaking changes and semver. We can fix this without breaking changes by not changing the current As with the encryption service possibility of message corruption we also continue operation and try to decrypt with old behavior but each time log a warning in the log that possible corruption can occur. The same can be applied here, log a warning if the persister does not have the new interface and mention that there is the possibility of data loss. Also, at startup we can log an ERROR mentioning that the current
The only problem remaining is that now timeouts are potentially processed more than once. A customer needs to be aware of that when DTC is not enabled. Maybe not a warning if DTC is not enabled that more then only delivery is possible? The customer probably already has idempotent handling or is using the outbox but image a case where ONE endpoint is not using DTC with outbox and another is relying on DTC. The DTC endpoint will potentially receive the same message twice and still requires idempotent processing or the outbox. |
Yup - @andreasohlund explained it to me on a call. |
@ramonsmits according to SemVer, adding a new interface would actually require a minor version bump. But I think we basically all agreed to be an acceptable solution for the issue since it won't break anyone. about the warning when an user uses an older transport not providing an interface: there were different ideas from logging an warning up to throwing an exception and essentially shutting down the endpoint if an user uses an outdated version and meeting conditions which would expose him to the bug (and providing an "suppress-potential-message-loss-warning" switch. |
@timbussmann I'll work on updating azure persistence. Updated the list to have the 7.0 (core v6). |
@SeanFeldman I think so, yes. We already have the same issue text on all NSB Core releases (on the PR). But let's have that text short and link to this issue. |
Hotfixed versions on NSB were on myget. Now they are gone. What feed can I use to continue work on NSB.Azure? Thanks. |
@SeanFeldman sorry I removed them in order to push the newest package. Should be there again. |
List of the azure-related packages that need to be released from MyGet (pushed into NuGet):
Release notes created manually (support- branches are not working with Octopus (failing). |
Taskforce: @weralabaj @timbussmann @SeanFeldman So if we can't find a timeout to remove from storage we just assume it was already disapatched and complete the receive. This works fine when both storage and transport can use DTC (or share tx like NH + SQLTransport) since the delete will rollback. All good so far. When DTC or TX sharing isn't possibleIn this mode we'll remove the TM and commit the storage TX. If we then go boom on the send or on the commit of the queueing TX we'll rollback. On the retry we'll not find the timeout => message loss I think we're better off to reverse the ops here, ie to send the message and then delete the TM. This means we won't loose the message but instead open up for duplicate timeouts. This is IMO acceptable because for those transports/modes of operation duplicate messages is expected and needs to be handled be the developer (or our outbox) Questions
Plan of Attack
|
@andreasohlund @weralabaj I think we can close the issue now, since implementations of IPersistTimeouts need to adjust to the changed interface anyway when upgrading to v6? |
Go for it. |
👍 On Fri, Oct 16, 2015 at 1:29 PM, Weronika Łabaj notifications@github.com
|
Notice of a Critical Bug regarding Timeouts in NServiceBus
Recently we have discovered an edge case scenario with timeout management that might result in a message loss. The problem occurs when the transport send operation fails during the timeout message dispatch which causes the timeout to be lost.
The issue affects all versions of NServiceBus (starting from version 3.0.0 up to version 6.0.0 exclusive) when distributed transactions are disabled or when the chosen transport and persistence combination doesn't support distributed transactions.
Here you can find detailed information on what the problem is, whether you may be affected and how to proceed if you are under risk.
How to know if you might be affected
You might be affected if:
and:
What to do if you are under risk
Upgrade packages
First of all, update NServiceBus nuget package to the version containing the fix. We've released patches for all minor versions starting from 4.4.0. If you are using an earlier version then we recommend upgrading to at least 4.4.8.
Depending on the combination of the NServiceBus version, persistence and transport, you might also need to additionally update persistence and transport packages.
If you are using any of the packages below, please make sure you are using one of the listed versions:
Disable exception
With the new releases of NServiceBus, endpoints will stop and warn you if you're using an outdated timeout persister or transport which could lead to message loss. The warning can contain one of the following messages:
You are using an outdated timeout persistence which can lead to message loss! Please update the configured timeout persistence (...).
You are using an outdated transport which can lead to message loss! Please update the configured transport (...).
In case you're unable to update the affected package you can suppress the warning and continue to run in an unsafe mode. The warnings can be suppressed using a code API or via the configuration file:
To disable exception via code include the following lines in the endpoint configuration:
For NServiceBus v4.x:
For NServiceBus v5.x:
To disable exception via configuration add the following section in app.config file:
For NServiceBus v4.x:
For NServiceBus v5.x:
Please, double check that you are not under risk before disabling the exception. Even if you are not affected, we recommend to upgrade transport and persistence packages as soon as possible.
Disabling exception should be considered a temporary workaround that gives you more time for upgrading all dependencies, but it doesn't solve the actual problem.
Custom persistence provider
If you rely on a custom persistence implementation or support one of the community persistence projects, you can find more details in the documentation (http://docs.particular.net/nservicebus/persistence/authoring-custom).
If you have any questions or need assistance in upgrade, do not hesitate to contact our support team.
Regards,
The Particular Engineering Team
The text was updated successfully, but these errors were encountered: