[service-bus] Fix message loss issues with peekLock and receiveAndDelete #15989

richardpark-msft · 2021-06-25T02:09:23Z

Fixing an issue where we could lose messages or provoke an alarming message from rhea (Received transfer when credit was 0)

The message loss issue is related to how we trigger 'drain' using 'addCredit(1)'. Our 'receiver.drain; receiver.addCredit(1)' pattern actually does add a credit, which shows up in the flow frame that gets sent for our drain. This has led to occasionally receiving more messages than we intended.

The second part of this was that we were masking this error because we had code that specifically threw out messages if more arrived than were requested. If the message was being auto-renewed it's possible for the message to appear to be missing, and if we were in receiveAndDelete the message is effectively lost at that point. That code is now removed (we defer to just allowing the extrra message, should a bug arise that causes that) and we log an error indicating it did happen.

The rhea error message appeared to be triggered by our accidentally allowing multiple overlapping 'drain's to occur (finalAction did not check to see if we were already draining and would allow it to happen multiple times). Removing the concurrent drains fixed this issue but I didn't fully investigate why.

Fixes #15606, #15115

…essage from rhea. The message loss issue is related to how we trigger 'drain' using 'addCredit(1)'. Our 'receiver.drain; receiver.addCredit(1)' pattern actually does add a credit, which shows up in the flow frame that gets sent for our drain. This has led to occasionally receiving more messages than we intended. The second part of this was that we were masking this error because we had code that specifically threw out messages if more arrived than were requested. If the message was being auto-renewed it's possible for the message to appear to be missing, and if we were in receiveAndDelete the message is effectively lost at that point. That code is now removed (we defer to just allowing the extrra message, should a bug arise that causes that) and we log an error indicating it did happen. The rhea error message appeared to be triggered by our accidentallly allowing multiple overlapping 'drain's to occur (finalAction did not check to see if we were _already_ draining and would allow it to happen multiple times). Removing the concurrent drains fixed this issue but I didn't fully investigate why.

…hack.

richardpark-msft · 2021-06-25T21:31:13Z

/azp run js - service-bus - tests

azure-pipelines · 2021-06-25T21:31:25Z

Azure Pipelines successfully started running 1 pipeline(s).

chradek

I see you added a stress test scenario. Were you ever able to find a way to reliably reproduce the issue in a controlled environment? I didn't see any non-stress tests added.

If you emitted an additional message (amqp onMessage) after you call addCredit(1) with drain set to true, would that trigger it as well? Or does it have to go through rhea?

sdk/servicebus/service-bus/src/core/batchingReceiver.ts

richardpark-msft · 2021-06-26T00:42:27Z

/azp run js - service-bus - tests

azure-pipelines · 2021-06-26T00:42:38Z

Azure Pipelines successfully started running 1 pipeline(s).

… the event listener was registered, rather than when the message handler was called. Fixing this revealed a couple of spots that were incorrectly using the wrong timeout. Also, storing off the finalAction in an internal member variable to maek some testing easier.

…ode that was trying to be a bad eventemitter. Replaced it with an actual eventemitter instead. This had a nice byproduct of removing more code. I also did some renames of variables because `receiver` is no longer a descriptive word (BatchingReceiver, BatchingReceiverLite, ServiceBusReceiver, etc..).

richardpark-msft · 2021-06-26T03:56:52Z

I see you added a stress test scenario. Were you ever able to find a way to reliably reproduce the issue in a controlled environment? I didn't see any non-stress tests added.

Prior to the fixes I could definitely reproduce this. The stress test went the extra mile of hooking into the internal onMessage() call (before any interpretation by our code) which was flagging the error much earlier.

If you emitted an additional message (amqp onMessage) after you call addCredit(1) with drain set to true, would that trigger it as well? Or does it have to go through rhea?

For the real world failure you'd need to go through rhea, and even then it's possible that you don't get an extra message. Even when I was reproducing it it was sproadic. The only guarantee was that, somewhere in the 200 receives (of 5 messages apiece) to AUS I would see it at least a couple of times.

The reality is that this could have been happening for a long while and it would be hard to notice - the code that was previously throwing away 'excess' messages was basically masking the problem.

For a unit test (which is what I've added in the push I did just now) I basically just simulate calling finalAction() and checking the before/after credit values (and ensuring I don't try to do concurrent drain requests). That test is pretty simple, but combined with the stress test, gives me good confidence we've hit the key area.

richardpark-msft · 2021-06-26T03:57:17Z

/azp run js - service-bus - tests

azure-pipelines · 2021-06-26T03:57:29Z

Azure Pipelines successfully started running 1 pipeline(s).

richardpark-msft · 2021-06-26T18:12:19Z

/azp run js - service-bus - tests

azure-pipelines · 2021-06-26T18:12:31Z

Azure Pipelines successfully started running 1 pipeline(s).

richardpark-msft · 2021-06-26T18:12:41Z

(previous live test run passed, just want to be sure so I'll be launching it a few times)

…at adds `.drain_credit()`

…ion of rhea

…oing to get!)

richardpark-msft · 2021-06-30T21:02:42Z

/azp run js - service-bus - tests

azure-pipelines · 2021-06-30T21:02:55Z

Azure Pipelines successfully started running 1 pipeline(s).

…ange so I'm just rolling that back. We'll release core-amqp at some point in the future, no need to do it now.

richardpark-msft · 2021-06-30T22:44:40Z

/azp run js - service-bus - tests

azure-pipelines · 2021-06-30T22:44:52Z

Azure Pipelines successfully started running 1 pipeline(s).

richardpark-msft · 2021-06-30T23:14:15Z

/azp run js - service-bus - tests

azure-pipelines · 2021-06-30T23:14:28Z

Azure Pipelines successfully started running 1 pipeline(s).

sdk/core/core-amqp/CHANGELOG.md

…ing it.

richardpark-msft · 2021-07-01T00:18:08Z

/azp run js - service-bus - tests

azure-pipelines · 2021-07-01T00:18:20Z

Azure Pipelines successfully started running 1 pipeline(s).

@chradek

@chradek says that there are no changes to core-amqp as part of #15989, so updating changelog

richardpark-msft requested review from chradek and HarshaNalluru as code owners June 25, 2021 02:09

ghost added the Service Bus label Jun 25, 2021

richardpark-msft added 2 commits June 25, 2021 02:10

Remove comments

7562013

Implementing the minimal amount needed to allow for the drain/credit …

4338119

…hack.

richardpark-msft added 2 commits June 25, 2021 21:38

Formatting

127ac43

Updating changelog

fc0c51a

chradek reviewed Jun 25, 2021

View reviewed changes

sdk/servicebus/service-bus/src/core/batchingReceiver.ts Outdated Show resolved Hide resolved

sdk/servicebus/service-bus/src/core/batchingReceiver.ts Outdated Show resolved Hide resolved

Use logger.warning

86a5613

richardpark-msft added 2 commits June 26, 2021 02:26

richardpark-msft added 3 commits June 29, 2021 17:22

Updating core-amqp changelog and package.json for new rhea version th…

a2c2595

…at adds `.drain_credit()`

Updating core-amqp dependency to 3.1.0 to have access to the new vers…

085aeec

…ion of rhea

Update mockhubs and eventhubs to use the newest core-amqp and rhea.

a13ee56

richardpark-msft mentioned this pull request Jun 29, 2021

Exposing the new drain_credit() method from rhea amqp/rhea-promise#88

Merged

richardpark-msft added 4 commits June 29, 2021 22:50

Update package version to what it is now.

1ea8206

Package version update

bd46bfc

Update package version

218ab80

package version and release date

c0eefee

richardpark-msft requested review from deyaaeldeen, mikeharder, praveenkuttappan and witemple-msft as code owners June 30, 2021 20:49

richardpark-msft added 4 commits June 30, 2021 20:57

clean checkout from main, then rush update

22d3a8a

Merge remote-tracking branch 'upstream/main' into sb-enoeden-clean-fix

752ac46

Updated rhea-promise and rhea (as minimal as a rush update as we're g…

06e2d21

…oing to get!)

Updated rhea-promise and rhea (as minimal as a rush update as we're g…

861a7ca

…oing to get!)

There's nothing in the latest core-amqp that's required for _this_ ch…

99d7750

…ange so I'm just rolling that back. We'll release core-amqp at some point in the future, no need to do it now.

richardpark-msft added 3 commits June 30, 2021 23:41

revert pnpm-lock.yaml before merge

cac8131

Merge remote-tracking branch 'upstream/main' into sb-enoeden-clean-fix

88489b2

revert to main, and rush update again.

e0a442c

chradek reviewed Jun 30, 2021

View reviewed changes

sdk/core/core-amqp/CHANGELOG.md Outdated Show resolved Hide resolved

Revert the date on the changelog for core-amqp since we're not releas…

5859334

…ing it.

chradek approved these changes Jul 1, 2021

View reviewed changes

richardpark-msft merged commit d75f119 into Azure:main Jul 1, 2021

richardpark-msft deleted the sb-enoeden-clean-fix branch July 1, 2021 01:43

This was referenced Jul 1, 2021

[Service Bus] Message loss with receiveMessages in receiveAndDelete mode when maxWaitTime is under 2 seconds #15115

Closed

Service Bus v7 hangs receiving messages #15488

Closed

richardpark-msft mentioned this pull request Jul 15, 2021

[service-bus] Test and remove old code paths that avoided 'drain' #13094

Closed

ramya-rao-a added a commit that referenced this pull request Aug 3, 2021

Remove false changelog entry for core-amqp

7b87059

@chradek says that there are no changes to core-amqp as part of #15989, so updating changelog

ramya-rao-a mentioned this pull request Aug 3, 2021

Remove false changelog entry for core-amqp #16737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[service-bus] Fix message loss issues with peekLock and receiveAndDelete #15989

[service-bus] Fix message loss issues with peekLock and receiveAndDelete #15989

richardpark-msft commented Jun 25, 2021 •

edited

Loading

richardpark-msft commented Jun 25, 2021

azure-pipelines bot commented Jun 25, 2021

chradek left a comment

richardpark-msft commented Jun 26, 2021

azure-pipelines bot commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

azure-pipelines bot commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

azure-pipelines bot commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

richardpark-msft commented Jun 30, 2021

azure-pipelines bot commented Jun 30, 2021

richardpark-msft commented Jun 30, 2021

azure-pipelines bot commented Jun 30, 2021

richardpark-msft commented Jun 30, 2021

azure-pipelines bot commented Jun 30, 2021

richardpark-msft commented Jul 1, 2021

azure-pipelines bot commented Jul 1, 2021

[service-bus] Fix message loss issues with peekLock and receiveAndDelete #15989

[service-bus] Fix message loss issues with peekLock and receiveAndDelete #15989

Conversation

richardpark-msft commented Jun 25, 2021 • edited Loading

richardpark-msft commented Jun 25, 2021

azure-pipelines bot commented Jun 25, 2021

chradek left a comment

Choose a reason for hiding this comment

richardpark-msft commented Jun 26, 2021

azure-pipelines bot commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

azure-pipelines bot commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

azure-pipelines bot commented Jun 26, 2021

richardpark-msft commented Jun 26, 2021

richardpark-msft commented Jun 30, 2021

azure-pipelines bot commented Jun 30, 2021

richardpark-msft commented Jun 30, 2021

azure-pipelines bot commented Jun 30, 2021

richardpark-msft commented Jun 30, 2021

azure-pipelines bot commented Jun 30, 2021

richardpark-msft commented Jul 1, 2021

azure-pipelines bot commented Jul 1, 2021

richardpark-msft commented Jun 25, 2021 •

edited

Loading