Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outbox cleanup in scaled out environments requires advanced configuration. Cleanup should be adaptive and elastic scaleout friendly by default #987

Open
ramonsmits opened this issue Aug 8, 2022 · 1 comment

Comments

@ramonsmits
Copy link
Member

ramonsmits commented Aug 8, 2022

By default instances will be competing for the outbox cleanup task. If for instances the endpoint is scaled out to 5 instances then the cleanup task will be run 5 times per minute.

Cleanup active/passive via leader election

It would be nice if via a sort of leader selection the cleanup will be ran on only a single node and/or that cleanup is adaptive/dynamic. Meaning, on very low volume endpoints

For example, have a table that contains a lease for that endpoint outbox cleanup. The lease could be for example 10 minutes. The lease owner would extent the lease every 5 minutes. Other instance should try to renew at the end of the lease but will fail if the active instance already renewed. If the active instance dies gracefully it can DELETE the lease record, if it dies ungracefully any of the other passive instance will obtain the lease they try to update the query

Install native cleanup job

An alternative would be that during installation a native cleanup job is scheduled and that an endpoint instance can detect if this job is ran frequently.

@bbrandt
Copy link

bbrandt commented Jun 25, 2024

This explains a lot. I ran a load test experiment today, turning on Outbox for the first time, using between 20 and 40 nodes for 3 different services. (Azure Service Bus transport and SQL persistence.) I noticed Outbox seemed to add about 5 seconds per handler and hadn't had time to dig into why (or call sp_blitzWho and all that). If the cleanup is running 20-40 times every 5 min and the DispatechAt column is not indexed (#1343) that could be a contributor to the poor throughput I saw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants