Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatching timeouts on non DTC transports + storages can result in message loss #2885

Closed
andreasohlund opened this issue Sep 11, 2015 · 23 comments
Assignees
Labels

Comments

@andreasohlund
Copy link
Member

Notice of a Critical Bug regarding Timeouts in NServiceBus

Recently we have discovered an edge case scenario with timeout management that might result in a message loss. The problem occurs when the transport send operation fails during the timeout message dispatch which causes the timeout to be lost.

The issue affects all versions of NServiceBus (starting from version 3.0.0 up to version 6.0.0 exclusive) when distributed transactions are disabled or when the chosen transport and persistence combination doesn't support distributed transactions.

Here you can find detailed information on what the problem is, whether you may be affected and how to proceed if you are under risk.

How to know if you might be affected

You might be affected if:

  • You are using RabbitMQ or Azure transport or you are using MSMQ or SQL transport with distributed transactions (DTC) disabled.

and:

  • You are using saga timeouts, second level retries or you defer dispatching a message using lower-level methods directly (e.g. bus.Defer()).

What to do if you are under risk

Upgrade packages

First of all, update NServiceBus nuget package to the version containing the fix. We've released patches for all minor versions starting from 4.4.0. If you are using an earlier version then we recommend upgrading to at least 4.4.8.

Depending on the combination of the NServiceBus version, persistence and transport, you might also need to additionally update persistence and transport packages.

If you are using any of the packages below, please make sure you are using one of the listed versions:

  • NServiceBus.Core: 4.4.8, 4.5.7, 4.6.10, 4.7.7, 5.0.6, 5.1.4, 5.2.8 or higher
  • NServiceBus.NHibernate: 4.5.5, 5.0.3, 6.0.3, 6.1.3, 6.2.3 or higher
  • NServiceBus.RavenDB: 1.0.2, 2.0.3, 2.1.2, 2.2.2, 3.0.4 or higher
  • NServiceBus.SqlServer: 1.2.4 or higher
  • NServiceBus.Azure: 5.3.12, 6.0.7, 6.1.5, 6.2.3 or higher

Disable exception

With the new releases of NServiceBus, endpoints will stop and warn you if you're using an outdated timeout persister or transport which could lead to message loss. The warning can contain one of the following messages:

You are using an outdated timeout persistence which can lead to message loss! Please update the configured timeout persistence (...).

You are using an outdated transport which can lead to message loss! Please update the configured transport (...).

In case you're unable to update the affected package you can suppress the warning and continue to run in an unsafe mode. The warnings can be suppressed using a code API or via the configuration file:

To disable exception via code include the following lines in the endpoint configuration:
For NServiceBus v4.x:

Configure configure = Configure.With();
configure.SuppressOutdatedTimeoutPersistenceWarning();
configure.SuppressOutdatedTransportWarning();

For NServiceBus v5.x:

BusConfiguration busConfiguration = new BusConfiguration();
busConfiguration.SuppressOutdatedTimeoutPersistenceWarning();

To disable exception via configuration add the following section in app.config file:
For NServiceBus v4.x:

<appSettings>
  <add key="NServiceBus/suppress-outdated-timeout-persistence-warning" value="true" />
  <add key="NServiceBus/suppress-outdated-transport-warning" value="true" />
</appSettings>

For NServiceBus v5.x:

<appSettings>
  <add key="NServiceBus/suppress-outdated-timeout-persistence-warning" value="true" />
</appSettings>

Please, double check that you are not under risk before disabling the exception. Even if you are not affected, we recommend to upgrade transport and persistence packages as soon as possible.
Disabling exception should be considered a temporary workaround that gives you more time for upgrading all dependencies, but it doesn't solve the actual problem.

Custom persistence provider

If you rely on a custom persistence implementation or support one of the community persistence projects, you can find more details in the documentation (http://docs.particular.net/nservicebus/persistence/authoring-custom).

If you have any questions or need assistance in upgrade, do not hesitate to contact our support team.

Regards,
The Particular Engineering Team

@SeanFeldman
Copy link
Contributor

@indualagarsamy
Copy link
Contributor

@andreasohlund - is this a recent behavior change - do you know since what version we are behaving this way?

@ramonsmits
Copy link
Member

@ramonsmits
Copy link
Member

Assuming that TryRemove can be run as an isolated transaction then the message send should be performed before the removal.

This means there is a possibility of more than once delivery of the message only if no DTC is used because when its not a shared connection but a new one then this TX will be enlisted in DTC.

In case of not using DTC, then the customer will probably already have the outbox enabled or its own idempotent processing to prevent more than once processing.

The fix would be to simply reverse the items:

MessageSender.Send(timeoutData.ToTransportMessage(), timeoutData.ToSendOptions(Configure.LocalAddress));
TimeoutsPersister.TryRemove(timeoutId, out timeoutData))

@ramonsmits
Copy link
Member

In regards with the NHibernate implementation there is some magic happening:

The behavior is different depending on if:

  • there is a shared IDbConnection
  • there is an ambient transaction

When there is a shared transaction no actual commit will occur. The DELETE will be performed when the shared connection is committed. (I think when all message have been send, the behavior will commit?)

When no shared connection and no ambient transaction then a connection and new transaction is created. So DELETE will be committed.

When no shared connection and an ambient transaction then a new connection is created and enlisted in DTC. DELETE will be committed if ambient/DTC transaction is committed.

@ramonsmits
Copy link
Member

Could our plans to "delay" and batch all outgoing ops after handlers are done help here? (or would it actually hurt and we need to request a "immediate" disapatch

Batching sends would only be useful if you can commit the DELETE after sending the messages or storing them in the outbox. If you do not have control over TM commit or do not have a transaction then, indeed, the message MUST be send or stored in the oubox immediately .

Can we leverage the outbox if enabled?

No, outbox / batching has no benefit as it does not change the order but we should send it via the outbox immediately if enabled.

Seems like (at least my) idea that we would only need AtLeastOnce guarantees even for transports like MSMQ together with SQLServer was wrong, and that we should use the DTC if available?

My interpretation: The idea that we would only need at least once guarantees was wrong and we should use DTC when available.

I don't understand this. When there is no DTC, then all processing should result in at least once processing. What do you mean?

@andreasohlund
Copy link
Member Author

is this a recent behavior change - do you know since what version we are behaving this way?

Its been this way at least since v4

@andreasohlund
Copy link
Member Author

I don't understand this. When there is no DTC, then all processing should result in at least once processing. What do you mean?

The idea was the the timout infrastructure never would need to use the DTC even thought it is available to get better perf. This seems wrong to me now (it was just an idea , cc @SzymonPobiega )

@udidahan
Copy link
Member

I've reread this already 3-4 times and I still don't see the message loss scenario. It's probably the decaying engineering cortext in my now predominantly CEO brain.

In this mode we'll remove the TM and commit the storage TX. If we then go boom on the send or on the commit of the queueing TX we'll rollback. On the retry we'll not find the timeout => message loss

That sounds like the original message is not lost.
What might be lost here is the timeout message / data.

But the context was:

So if we can't find a timeout to remove from storage

So what exactly is lost here?

@SeanFeldman
Copy link
Contributor

@udidahan if TM message that is based off a TM table record is not dispatched and TM table record is gone, the TM message will never happen again, hence the loss.

@ramonsmits
Copy link
Member

On the topic of breaking changes and semver. We can fix this without breaking changes by not changing the current IPersistTimeout interface. We add a new interface that does have the ability to read timeout data without deleting it. We implement it in the persisters and then release.

As with the encryption service possibility of message corruption we also continue operation and try to decrypt with old behavior but each time log a warning in the log that possible corruption can occur. The same can be applied here, log a warning if the persister does not have the new interface and mention that there is the possibility of data loss. Also, at startup we can log an ERROR mentioning that the current IPersistTimeout implementation needs to be upgraded to prevent loss of timeouts.

  • If a customer updates their packages everything works without message loss and no warnings etc.
  • If the customer only updates NServiceBus it will receive loads of errors/warnings to indicate an issue.
  • If the customer update all packages but uses a custom persistence implementation it gets errors and warnings to indicate to update their implementation.

The only problem remaining is that now timeouts are potentially processed more than once. A customer needs to be aware of that when DTC is not enabled. Maybe not a warning if DTC is not enabled that more then only delivery is possible? The customer probably already has idempotent handling or is using the outbox but image a case where ONE endpoint is not using DTC with outbox and another is relying on DTC. The DTC endpoint will potentially receive the same message twice and still requires idempotent processing or the outbox.

@udidahan
Copy link
Member

Yup - @andreasohlund explained it to me on a call.

@timbussmann
Copy link
Contributor

@ramonsmits according to SemVer, adding a new interface would actually require a minor version bump. But I think we basically all agreed to be an acceptable solution for the issue since it won't break anyone.

about the warning when an user uses an older transport not providing an interface: there were different ideas from logging an warning up to throwing an exception and essentially shutting down the endpoint if an user uses an outdated version and meeting conditions which would expose him to the bug (and providing an "suppress-potential-message-loss-warning" switch.
Again, following SemVer, the exception would clearly be a breaking change. But discussions have shown that most prioritize preventing message losses over SemVer. So both approaches seem valid.

@SeanFeldman
Copy link
Contributor

@timbussmann I'll work on updating azure persistence. Updated the list to have the 7.0 (core v6).

@timbussmann
Copy link
Contributor

@SeanFeldman I think so, yes. We already have the same issue text on all NSB Core releases (on the PR). But let's have that text short and link to this issue.

@SeanFeldman
Copy link
Contributor

Hotfixed versions on NSB were on myget. Now they are gone. What feed can I use to continue work on NSB.Azure? Thanks.
/cc @timbussmann

@timbussmann
Copy link
Contributor

@SeanFeldman sorry I removed them in order to push the newest package. Should be there again.

@SeanFeldman
Copy link
Contributor

List of the azure-related packages that need to be released from MyGet (pushed into NuGet):

  • 5.3.12
    • NServiceBus.Azure 5.3.12
    • NServiceBus.Azure.Transports.WindowsAzureServiceBus 5.3.12
    • NServiceBus.Azure.Transports.WindowsAzureStorageQueues 5.3.12
    • NServiceBus.Hosting.Azure 5.3.12
    • NServiceBus.Hosting.Azure.HostProcess 5.3.12
  • 6.0.7
    • NServiceBus.Azure 6.0.7
    • NServiceBus.Azure.Transports.WindowsAzureServiceBus 6.0.7
    • NServiceBus.Azure.Transports.WindowsAzureStorageQueues 6.0.7
    • NServiceBus.Hosting.Azure 6.0.7
    • NServiceBus.Hosting.Azure.HostProcess 6.0.7
  • 6.1.5
    • NServiceBus.Azure 6.1.5
    • NServiceBus.Azure.Transports.WindowsAzureServiceBus 6.1.5
    • NServiceBus.Azure.Transports.WindowsAzureStorageQueues 6.1.5
    • NServiceBus.Hosting.Azure 6.1.5
    • NServiceBus.Hosting.Azure.HostProcess 6.1.5
  • 6.2.3
    • NServiceBus.Azure 6.2.3

Release notes created manually (support- branches are not working with Octopus (failing).
Should be able to deploy from NuGet and publish release notes from GH.

@weralabaj
Copy link
Contributor

Taskforce: @weralabaj @timbussmann @SeanFeldman

So if we can't find a timeout to remove from storage

https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.Core/DelayedDelivery/TimeoutManager/DispatchTimeoutBehavior.cs#L26

we just assume it was already disapatched and complete the receive. This works fine when both storage and transport can use DTC (or share tx like NH + SQLTransport) since the delete will rollback. All good so far.

When DTC or TX sharing isn't possible

In this mode we'll remove the TM and commit the storage TX. If we then go boom on the send or on the commit of the queueing TX we'll rollback. On the retry we'll not find the timeout => message loss

I think we're better off to reverse the ops here, ie to send the message and then delete the TM. This means we won't loose the message but instead open up for duplicate timeouts. This is IMO acceptable because for those transports/modes of operation duplicate messages is expected and needs to be handled be the developer (or our outbox)

Questions

  • Could our plans to "delay" and batch all outgoing ops after handlers are done help here? (or would it actually hurt and we need to request a "immediate" disapatch
  • Can we leverage the outbox if enabled?
  • Seems like the (at least my) idea that we would only need AtLeastOnce guarantees even for transports like MSMQ together with SQLServer was wrong, and that we should use the DTC if available?

Plan of Attack

@timbussmann
Copy link
Contributor

@andreasohlund @weralabaj I think we can close the issue now, since implementations of IPersistTimeouts need to adjust to the changed interface anyway when upgrading to v6?

@weralabaj
Copy link
Contributor

Go for it.

@andreasohlund
Copy link
Member Author

👍

On Fri, Oct 16, 2015 at 1:29 PM, Weronika Łabaj notifications@github.com
wrote:

Go for it.


Reply to this email directly or view it on GitHub
#2885 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants