Dispatching timeouts on non DTC transports + storages can result in message loss #2885

andreasohlund · 2015-09-11T13:38:52Z

Notice of a Critical Bug regarding Timeouts in NServiceBus

Recently we have discovered an edge case scenario with timeout management that might result in a message loss. The problem occurs when the transport send operation fails during the timeout message dispatch which causes the timeout to be lost.

The issue affects all versions of NServiceBus (starting from version 3.0.0 up to version 6.0.0 exclusive) when distributed transactions are disabled or when the chosen transport and persistence combination doesn't support distributed transactions.

Here you can find detailed information on what the problem is, whether you may be affected and how to proceed if you are under risk.

How to know if you might be affected

You might be affected if:

You are using RabbitMQ or Azure transport or you are using MSMQ or SQL transport with distributed transactions (DTC) disabled.

and:

You are using saga timeouts, second level retries or you defer dispatching a message using lower-level methods directly (e.g. bus.Defer()).

What to do if you are under risk

Upgrade packages

First of all, update NServiceBus nuget package to the version containing the fix. We've released patches for all minor versions starting from 4.4.0. If you are using an earlier version then we recommend upgrading to at least 4.4.8.

Depending on the combination of the NServiceBus version, persistence and transport, you might also need to additionally update persistence and transport packages.

If you are using any of the packages below, please make sure you are using one of the listed versions:

NServiceBus.Core: 4.4.8, 4.5.7, 4.6.10, 4.7.7, 5.0.6, 5.1.4, 5.2.8 or higher
NServiceBus.NHibernate: 4.5.5, 5.0.3, 6.0.3, 6.1.3, 6.2.3 or higher
NServiceBus.RavenDB: 1.0.2, 2.0.3, 2.1.2, 2.2.2, 3.0.4 or higher
NServiceBus.SqlServer: 1.2.4 or higher
NServiceBus.Azure: 5.3.12, 6.0.7, 6.1.5, 6.2.3 or higher

Disable exception

With the new releases of NServiceBus, endpoints will stop and warn you if you're using an outdated timeout persister or transport which could lead to message loss. The warning can contain one of the following messages:

You are using an outdated timeout persistence which can lead to message loss! Please update the configured timeout persistence (...).

You are using an outdated transport which can lead to message loss! Please update the configured transport (...).

In case you're unable to update the affected package you can suppress the warning and continue to run in an unsafe mode. The warnings can be suppressed using a code API or via the configuration file:

To disable exception via code include the following lines in the endpoint configuration:
For NServiceBus v4.x:

Configure configure = Configure.With();
configure.SuppressOutdatedTimeoutPersistenceWarning();
configure.SuppressOutdatedTransportWarning();

For NServiceBus v5.x:

BusConfiguration busConfiguration = new BusConfiguration();
busConfiguration.SuppressOutdatedTimeoutPersistenceWarning();

To disable exception via configuration add the following section in app.config file:
For NServiceBus v4.x:

<appSettings>
  <add key="NServiceBus/suppress-outdated-timeout-persistence-warning" value="true" />
  <add key="NServiceBus/suppress-outdated-transport-warning" value="true" />
</appSettings>

For NServiceBus v5.x:

<appSettings>
  <add key="NServiceBus/suppress-outdated-timeout-persistence-warning" value="true" />
</appSettings>

Please, double check that you are not under risk before disabling the exception. Even if you are not affected, we recommend to upgrade transport and persistence packages as soon as possible.
Disabling exception should be considered a temporary workaround that gives you more time for upgrading all dependencies, but it doesn't solve the actual problem.

Custom persistence provider

If you rely on a custom persistence implementation or support one of the community persistence projects, you can find more details in the documentation (http://docs.particular.net/nservicebus/persistence/authoring-custom).

If you have any questions or need assistance in upgrade, do not hesitate to contact our support team.

Regards,
The Particular Engineering Team

SeanFeldman · 2015-09-11T13:54:06Z

@andreasohlund this https://groups.google.com/forum/#!topic/particularsoftware/ZMs8Q_gD2dw could be related

indualagarsamy · 2015-09-11T18:25:50Z

@andreasohlund - is this a recent behavior change - do you know since what version we are behaving this way?

ramonsmits · 2015-09-11T18:33:04Z

By the way, develop is so different from

https://github.com/Particular/NServiceBus/blob/5.2.5/src/NServiceBus.Core/Timeout/Hosting/Windows/TimeoutDispatcherProcessor.cs#L33

Also, it is very unclear what the behavior is of the NHibernate TryRemove implementation:

https://github.com/Particular/NServiceBus.NHibernate/blob/6.2.2/src/NServiceBus.NHibernate/TimeoutPersisters/TimeoutPersister.cs#L125

ramonsmits · 2015-09-11T18:50:55Z

Assuming that TryRemove can be run as an isolated transaction then the message send should be performed before the removal.

This means there is a possibility of more than once delivery of the message only if no DTC is used because when its not a shared connection but a new one then this TX will be enlisted in DTC.

In case of not using DTC, then the customer will probably already have the outbox enabled or its own idempotent processing to prevent more than once processing.

The fix would be to simply reverse the items:

MessageSender.Send(timeoutData.ToTransportMessage(), timeoutData.ToSendOptions(Configure.LocalAddress));
TimeoutsPersister.TryRemove(timeoutId, out timeoutData))

ramonsmits · 2015-09-11T19:13:02Z

In regards with the NHibernate implementation there is some magic happening:

The behavior is different depending on if:

there is a shared IDbConnection
there is an ambient transaction

When there is a shared transaction no actual commit will occur. The DELETE will be performed when the shared connection is committed. (I think when all message have been send, the behavior will commit?)

When no shared connection and no ambient transaction then a connection and new transaction is created. So DELETE will be committed.

When no shared connection and an ambient transaction then a new connection is created and enlisted in DTC. DELETE will be committed if ambient/DTC transaction is committed.

ramonsmits · 2015-09-11T19:31:18Z

Could our plans to "delay" and batch all outgoing ops after handlers are done help here? (or would it actually hurt and we need to request a "immediate" disapatch

Batching sends would only be useful if you can commit the DELETE after sending the messages or storing them in the outbox. If you do not have control over TM commit or do not have a transaction then, indeed, the message MUST be send or stored in the oubox immediately .

Can we leverage the outbox if enabled?

No, outbox / batching has no benefit as it does not change the order but we should send it via the outbox immediately if enabled.

Seems like (at least my) idea that we would only need AtLeastOnce guarantees even for transports like MSMQ together with SQLServer was wrong, and that we should use the DTC if available?

My interpretation: The idea that we would only need at least once guarantees was wrong and we should use DTC when available.

I don't understand this. When there is no DTC, then all processing should result in at least once processing. What do you mean?

andreasohlund · 2015-09-12T06:03:48Z

is this a recent behavior change - do you know since what version we are behaving this way?

Its been this way at least since v4

andreasohlund · 2015-09-12T06:05:44Z

I don't understand this. When there is no DTC, then all processing should result in at least once processing. What do you mean?

The idea was the the timout infrastructure never would need to use the DTC even thought it is available to get better perf. This seems wrong to me now (it was just an idea , cc @SzymonPobiega )

udidahan · 2015-09-15T21:00:11Z

I've reread this already 3-4 times and I still don't see the message loss scenario. It's probably the decaying engineering cortext in my now predominantly CEO brain.

In this mode we'll remove the TM and commit the storage TX. If we then go boom on the send or on the commit of the queueing TX we'll rollback. On the retry we'll not find the timeout => message loss

That sounds like the original message is not lost.
What might be lost here is the timeout message / data.

But the context was:

So if we can't find a timeout to remove from storage

So what exactly is lost here?

SeanFeldman · 2015-09-15T21:13:41Z

@udidahan if TM message that is based off a TM table record is not dispatched and TM table record is gone, the TM message will never happen again, hence the loss.

ramonsmits · 2015-09-16T09:13:30Z

On the topic of breaking changes and semver. We can fix this without breaking changes by not changing the current IPersistTimeout interface. We add a new interface that does have the ability to read timeout data without deleting it. We implement it in the persisters and then release.

As with the encryption service possibility of message corruption we also continue operation and try to decrypt with old behavior but each time log a warning in the log that possible corruption can occur. The same can be applied here, log a warning if the persister does not have the new interface and mention that there is the possibility of data loss. Also, at startup we can log an ERROR mentioning that the current IPersistTimeout implementation needs to be upgraded to prevent loss of timeouts.

If a customer updates their packages everything works without message loss and no warnings etc.
If the customer only updates NServiceBus it will receive loads of errors/warnings to indicate an issue.
If the customer update all packages but uses a custom persistence implementation it gets errors and warnings to indicate to update their implementation.

The only problem remaining is that now timeouts are potentially processed more than once. A customer needs to be aware of that when DTC is not enabled. Maybe not a warning if DTC is not enabled that more then only delivery is possible? The customer probably already has idempotent handling or is using the outbox but image a case where ONE endpoint is not using DTC with outbox and another is relying on DTC. The DTC endpoint will potentially receive the same message twice and still requires idempotent processing or the outbox.

udidahan · 2015-09-16T14:50:56Z

Yup - @andreasohlund explained it to me on a call.

timbussmann · 2015-09-16T17:12:19Z

@ramonsmits according to SemVer, adding a new interface would actually require a minor version bump. But I think we basically all agreed to be an acceptable solution for the issue since it won't break anyone.

about the warning when an user uses an older transport not providing an interface: there were different ideas from logging an warning up to throwing an exception and essentially shutting down the endpoint if an user uses an outdated version and meeting conditions which would expose him to the bug (and providing an "suppress-potential-message-loss-warning" switch.
Again, following SemVer, the exception would clearly be a breaking change. But discussions have shown that most prioritize preventing message losses over SemVer. So both approaches seem valid.

SeanFeldman · 2015-09-22T20:38:27Z

@timbussmann I'll work on updating azure persistence. Updated the list to have the 7.0 (core v6).

timbussmann · 2015-10-07T19:13:19Z

@SeanFeldman I think so, yes. We already have the same issue text on all NSB Core releases (on the PR). But let's have that text short and link to this issue.

SeanFeldman · 2015-10-07T20:40:18Z

Hotfixed versions on NSB were on myget. Now they are gone. What feed can I use to continue work on NSB.Azure? Thanks.
/cc @timbussmann

timbussmann · 2015-10-08T09:09:49Z

@SeanFeldman sorry I removed them in order to push the newest package. Should be there again.

SeanFeldman · 2015-10-14T19:04:05Z

List of the azure-related packages that need to be released from MyGet (pushed into NuGet):

5.3.12
- NServiceBus.Azure 5.3.12
- NServiceBus.Azure.Transports.WindowsAzureServiceBus 5.3.12
- NServiceBus.Azure.Transports.WindowsAzureStorageQueues 5.3.12
- NServiceBus.Hosting.Azure 5.3.12
- NServiceBus.Hosting.Azure.HostProcess 5.3.12
6.0.7
- NServiceBus.Azure 6.0.7
- NServiceBus.Azure.Transports.WindowsAzureServiceBus 6.0.7
- NServiceBus.Azure.Transports.WindowsAzureStorageQueues 6.0.7
- NServiceBus.Hosting.Azure 6.0.7
- NServiceBus.Hosting.Azure.HostProcess 6.0.7
6.1.5
- NServiceBus.Azure 6.1.5
- NServiceBus.Azure.Transports.WindowsAzureServiceBus 6.1.5
- NServiceBus.Azure.Transports.WindowsAzureStorageQueues 6.1.5
- NServiceBus.Hosting.Azure 6.1.5
- NServiceBus.Hosting.Azure.HostProcess 6.1.5
6.2.3
- NServiceBus.Azure 6.2.3

Release notes created manually (support- branches are not working with Octopus (failing).
Should be able to deploy from NuGet and publish release notes from GH.

weralabaj · 2015-10-15T20:00:01Z

Taskforce: @weralabaj @timbussmann @SeanFeldman

So if we can't find a timeout to remove from storage

https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.Core/DelayedDelivery/TimeoutManager/DispatchTimeoutBehavior.cs#L26

we just assume it was already disapatched and complete the receive. This works fine when both storage and transport can use DTC (or share tx like NH + SQLTransport) since the delete will rollback. All good so far.

When DTC or TX sharing isn't possible

In this mode we'll remove the TM and commit the storage TX. If we then go boom on the send or on the commit of the queueing TX we'll rollback. On the retry we'll not find the timeout => message loss

I think we're better off to reverse the ops here, ie to send the message and then delete the TM. This means we won't loose the message but instead open up for duplicate timeouts. This is IMO acceptable because for those transports/modes of operation duplicate messages is expected and needs to be handled be the developer (or our outbox)

Questions

Could our plans to "delay" and batch all outgoing ops after handlers are done help here? (or would it actually hurt and we need to request a "immediate" disapatch
Can we leverage the outbox if enabled?
Seems like the (at least my) idea that we would only need AtLeastOnce guarantees even for transports like MSMQ together with SQLServer was wrong, and that we should use the DTC if available?

Plan of Attack

timbussmann · 2015-10-16T11:23:51Z

@andreasohlund @weralabaj I think we can close the issue now, since implementations of IPersistTimeouts need to adjust to the changed interface anyway when upgrading to v6?

weralabaj · 2015-10-16T11:29:45Z

Go for it.

andreasohlund · 2015-10-16T11:36:40Z

👍

On Fri, Oct 16, 2015 at 1:29 PM, Weronika Łabaj notifications@github.com
wrote:

Go for it.

—
Reply to this email directly or view it on GitHub
#2885 (comment)
.

andreasohlund mentioned this issue Sep 13, 2015

Pass consistency guarantees to the dispatcher #2884

Merged

andreasohlund added the State: Getting Ready label Sep 14, 2015

andreasohlund assigned timbussmann Sep 14, 2015

andreasohlund added the Bug label Sep 14, 2015

This was referenced Sep 15, 2015

fix timeout dispatching #2889

Closed

Fix losing messages if timeout dispatch fails. #2901

Closed

timbussmann added State: In Progress and removed State: Getting Ready labels Sep 17, 2015

This was referenced Sep 17, 2015

prevent timeout message loss when not using DTC #2909

Merged

implement IPersistTimeoutsV2 interface Particular/NServiceBus.NHibernate#120

Merged

timbussmann mentioned this issue Sep 23, 2015

implement IPersistTimeoutsV2 interface Particular/NServiceBus.RavenDB#89

Merged

SeanFeldman mentioned this issue Sep 23, 2015

Hotfix 6.0.7 - Prevent timeout message loss when not using DTC Particular/NServiceBus.Azure#268

Merged

adamralph added the Size: L label Sep 24, 2015

This was referenced Sep 24, 2015

support suppressing native transactions Particular/NServiceBus.SqlServer#92

Merged

prevent timeout message loss when not using DTC #2924

Merged

weralabaj mentioned this issue Sep 28, 2015

Custom timeouts persistance doco update Particular/docs.particular.net#853

Merged

weralabaj mentioned this issue Oct 8, 2015

Adjust to changes in timeouts persistence interface in NServiceBus sbmako/NServiceBus.MongoDB#20

Closed

timbussmann mentioned this issue Oct 8, 2015

implement implement IPersistTimeoutsV2 interface Particular/NServiceBus.NHibernate#131

Merged

weralabaj mentioned this issue Oct 12, 2015

prevent timeout message loss when not using DTC #2986

Closed

weralabaj assigned weralabaj and unassigned timbussmann Oct 12, 2015

This was referenced Oct 13, 2015

prevent timeout message loss when not using DTC #2990

Closed

prevent timeout message loss when not using DTC #2991

Closed

timbussmann mentioned this issue Oct 13, 2015

Fix timeout dispatching for v6 #2996

Merged

timbussmann assigned timbussmann and unassigned weralabaj Oct 13, 2015

SeanFeldman mentioned this issue Oct 13, 2015

Hotfix 6.1.5 - Prevent timeout message loss when not using DTC Particular/NServiceBus.Azure#270

Merged

SeanFeldman assigned SeanFeldman and unassigned timbussmann Oct 13, 2015

This was referenced Oct 14, 2015

Hotfix 6.2.3- Prevent timeout message loss when not using DTC Particular/NServiceBus.Azure#271

Merged

Adjust to changes in timeouts persistence interface in NServiceBus synhershko/NServiceBus.DocumentDB#1

Open

weralabaj mentioned this issue Oct 14, 2015

prevent timeout message loss when not using DTC #3002

Closed

weralabaj assigned weralabaj and unassigned SeanFeldman Oct 14, 2015

weralabaj mentioned this issue Oct 15, 2015

prevent timeout message loss when not using DTC #3009

Merged

timbussmann closed this as completed Oct 16, 2015

timbussmann removed the State: In Progress label Oct 16, 2015

ramonsmits mentioned this issue Dec 24, 2015

Using NServiceBus pre 5.2.8 with version 6.2.5 results in NullReferenceException Particular/NServiceBus.NHibernate#140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatching timeouts on non DTC transports + storages can result in message loss #2885

Dispatching timeouts on non DTC transports + storages can result in message loss #2885

andreasohlund commented Sep 11, 2015

SeanFeldman commented Sep 11, 2015

indualagarsamy commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

andreasohlund commented Sep 12, 2015

andreasohlund commented Sep 12, 2015

udidahan commented Sep 15, 2015

SeanFeldman commented Sep 15, 2015

ramonsmits commented Sep 16, 2015

udidahan commented Sep 16, 2015

timbussmann commented Sep 16, 2015

SeanFeldman commented Sep 22, 2015

timbussmann commented Oct 7, 2015

SeanFeldman commented Oct 7, 2015

timbussmann commented Oct 8, 2015

SeanFeldman commented Oct 14, 2015

weralabaj commented Oct 15, 2015

timbussmann commented Oct 16, 2015

weralabaj commented Oct 16, 2015

andreasohlund commented Oct 16, 2015

Dispatching timeouts on non DTC transports + storages can result in message loss #2885

Dispatching timeouts on non DTC transports + storages can result in message loss #2885

Comments

andreasohlund commented Sep 11, 2015

Notice of a Critical Bug regarding Timeouts in NServiceBus

How to know if you might be affected

What to do if you are under risk

Upgrade packages

Disable exception

Custom persistence provider

SeanFeldman commented Sep 11, 2015

indualagarsamy commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

ramonsmits commented Sep 11, 2015

andreasohlund commented Sep 12, 2015

andreasohlund commented Sep 12, 2015

udidahan commented Sep 15, 2015

SeanFeldman commented Sep 15, 2015

ramonsmits commented Sep 16, 2015

udidahan commented Sep 16, 2015

timbussmann commented Sep 16, 2015

SeanFeldman commented Sep 22, 2015

timbussmann commented Oct 7, 2015

SeanFeldman commented Oct 7, 2015

timbussmann commented Oct 8, 2015

SeanFeldman commented Oct 14, 2015

weralabaj commented Oct 15, 2015

When DTC or TX sharing isn't possible

Questions

Plan of Attack

timbussmann commented Oct 16, 2015

weralabaj commented Oct 16, 2015

andreasohlund commented Oct 16, 2015