Description
Library name and version
Azure.Messaging.ServiceBus 7.18.2
Describe the bug
We are seeing an issue where transient connection errors to Service Bus (ServiceCommunicationProblem
) result in huge memory spikes (50%+) on our session receiver. This is a problem because we're uncertain how much we need to over-provision our k8s pods so they aren't restarted when reaching limits.
Expected behavior
We need to understand resource requirements, so we know how much to over-provision.
Actual behavior
We currently have a k8s cluster with a few hundred devices/session processors per instance, and this works perfectly well until these transient ServiceCommunicationProblem
errors occur every few days, which more than doubles the memory usage, exceeding the quota and causing k8s to restart the instance (the errors usually only occur in one or two instances at once, but not all of them).
Background
We are using code similar to the repro below in the gateway side of a system handling connections to thousands of IoT devices (I previously outlined our service side here: #42022 (comment)). Whenever a device connects to our gateway, we need a session processor with a session lock for that specific deviceId (to receive messages from our service to the device). We use an infinite renewal because we never want the lock to expire - we only ever want to lose the lock when the device disconnects and we call processor.CloseAsync()
. We want all the library's in-built backoff/retry mechanisms to keep the session lock, and to re-obtain it when necessary. We never want it to give up, because the device itself is still connected to us.
Questions
- Are these transient SB errors expected? They seem to cause error logs for up to 10 minutes, does that mean SB was inaccessible to that instance for that period, or is that a side effect? If it's an issue with SB itself, why aren't all our instances affected at once? Would upgrading from Standard to Premium tier eliminate these issues?
- Are transient SB errors expected to cause a doubling of memory usage? Should the usage go down once connectivity is restored?
- I wonder if the library should somehow detect that all several hundred processors are in error for the same transient SB reason affecting the single SB queue, and better 'pool' the retries without doubling the memory footprint?
- We also see
ErrorSource: RenewLock ex: Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Request a new session receiver. (SessionLockLost)
inProcessErrorAsync
. Is "Request a new session receiver" an instruction, ie I should dispose of this session and request a new one? Or should it keep retrying to get it back? (again we only want the lock to be released when the device disconnects, and we want to use the in-built retry).
I've read troubleshooting.md, and will continue experimenting with TryTimeout
, however it's not clear if that guidance is relevant given our MaxConcurrentSessions = 1
, and that everything works fine outside of these transient SB connection errors. Our message processing time is also very short (all we do is forward SB message to device, few milliseconds).
Reproduction Steps
Session receiver setup:
var options = new ServiceBusSessionProcessorOptions
{
AutoCompleteMessages = true,
MaxConcurrentSessions = 1,
MaxConcurrentCallsPerSession = 1,
MaxAutoLockRenewalDuration = Timeout.InfiniteTimeSpan,
SessionIdleTimeout = TimeSpan.FromMinutes(5),
SessionIds = { deviceId },
Identifier = connectionId
};
ServiceBusSessionProcessor processor = client.CreateSessionProcessor("GatewayRx", options);
processor.ProcessMessageAsync += SessionMessageHandler;
processor.ProcessErrorAsync += SessionErrorHandler;
await processor.StartProcessingAsync();
After the receiver acquires a session lock (for deviceId
) and starts receiving messages, temporarily block access to the Service Bus namespace from your IP via Azure Portal -> Service Bus resource -> Settings-Networking. You should then see a significant spike in memory usage of the application, which does not seem to come down once access is restored (even after several hours).
This seems to be a reasonable way to trigger the issue, but I'm not sure how equivalent it is to what occurs naturally every few hours-days.
Environment
.NET SDK:
Version: 8.0.205
Commit: 3e1383b780
Workload version: 8.0.200-manifests.818b3449
Repro'd using:
Runtime Environment:
OS Name: Windows
OS Version: 10.0.19045
OS Platform: Windows
RID: win-x64
Base Path: C:\Program Files\dotnet\sdk\8.0.205\
Prod k8s is Ubuntu.