Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[service-bus] Fix message loss issues with peekLock and receiveAndDelete #15989

Merged
merged 25 commits into from
Jul 1, 2021

Conversation

richardpark-msft
Copy link
Member

@richardpark-msft richardpark-msft commented Jun 25, 2021

Fixing an issue where we could lose messages or provoke an alarming message from rhea (Received transfer when credit was 0)

The message loss issue is related to how we trigger 'drain' using 'addCredit(1)'. Our 'receiver.drain; receiver.addCredit(1)' pattern actually does add a credit, which shows up in the flow frame that gets sent for our drain. This has led to occasionally receiving more messages than we intended.

The second part of this was that we were masking this error because we had code that specifically threw out messages if more arrived than were requested. If the message was being auto-renewed it's possible for the message to appear to be missing, and if we were in receiveAndDelete the message is effectively lost at that point. That code is now removed (we defer to just allowing the extrra message, should a bug arise that causes that) and we log an error indicating it did happen.

The rhea error message appeared to be triggered by our accidentally allowing multiple overlapping 'drain's to occur (finalAction did not check to see if we were already draining and would allow it to happen multiple times). Removing the concurrent drains fixed this issue but I didn't fully investigate why.

Fixes #15606, #15115

…essage from rhea.

The message loss issue is related to how we trigger 'drain' using 'addCredit(1)'. Our 'receiver.drain; receiver.addCredit(1)' pattern actually does add a credit, which shows up in the flow frame that gets sent for our drain. This has led to occasionally receiving more messages than we intended.

The second part of this was that we were masking this error because we had code that specifically threw out messages if more arrived than were requested. If the message was being auto-renewed it's possible for the message to appear to be missing, and if we were in receiveAndDelete the message is effectively lost at that point. That code is now removed (we defer to just allowing the extrra message, should a bug arise that causes that) and we log an error indicating it did happen.

The rhea error message appeared to be triggered by our accidentallly allowing multiple overlapping 'drain's to occur (finalAction did not check to see if we were _already_ draining and would allow it to happen multiple times). Removing the concurrent drains fixed this issue but I didn't fully investigate why.
@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@chradek chradek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added a stress test scenario. Were you ever able to find a way to reliably reproduce the issue in a controlled environment? I didn't see any non-stress tests added.

If you emitted an additional message (amqp onMessage) after you call addCredit(1) with drain set to true, would that trigger it as well? Or does it have to go through rhea?

sdk/servicebus/service-bus/src/core/batchingReceiver.ts Outdated Show resolved Hide resolved
sdk/servicebus/service-bus/src/core/batchingReceiver.ts Outdated Show resolved Hide resolved
@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

… the event listener was registered, rather than when the message handler was called. Fixing this revealed a couple of spots that were incorrectly using the wrong timeout.

Also, storing off the finalAction in an internal member variable to maek some testing easier.
…ode that was trying to be a bad eventemitter. Replaced it with an actual eventemitter instead.

This had a nice byproduct of removing more code.

I also did some renames of variables because `receiver` is no longer a descriptive word (BatchingReceiver, BatchingReceiverLite, ServiceBusReceiver, etc..).
@richardpark-msft
Copy link
Member Author

I see you added a stress test scenario. Were you ever able to find a way to reliably reproduce the issue in a controlled environment? I didn't see any non-stress tests added.

Prior to the fixes I could definitely reproduce this. The stress test went the extra mile of hooking into the internal onMessage() call (before any interpretation by our code) which was flagging the error much earlier.

If you emitted an additional message (amqp onMessage) after you call addCredit(1) with drain set to true, would that trigger it as well? Or does it have to go through rhea?

For the real world failure you'd need to go through rhea, and even then it's possible that you don't get an extra message. Even when I was reproducing it it was sproadic. The only guarantee was that, somewhere in the 200 receives (of 5 messages apiece) to AUS I would see it at least a couple of times.

The reality is that this could have been happening for a long while and it would be hard to notice - the code that was previously throwing away 'excess' messages was basically masking the problem.

For a unit test (which is what I've added in the push I did just now) I basically just simulate calling finalAction() and checking the before/after credit values (and ensuring I don't try to do concurrent drain requests). That test is pretty simple, but combined with the stress test, gives me good confidence we've hit the key area.

@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richardpark-msft
Copy link
Member Author

(previous live test run passed, just want to be sure so I'll be launching it a few times)

@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…ange so I'm just rolling that back.

We'll release core-amqp at some point in the future, no need to do it now.
@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@richardpark-msft
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants