Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved stampede prevention with empty config cache under high loads #3530

Merged
merged 11 commits into from
Dec 19, 2023

Conversation

colinmollenhour
Copy link
Member

Description (*)

In a situation where generating the config takes a long time and the store is receiving heavy traffic - although there is already a basic stampede-prevention mechanism in place, it is not completely effective. There are three aggravating factors that this PR aims to fix:

  1. When using read replicas, sending the reads to the replicas to check if the config saving is locked makes it susceptible to race conditions caused by replication lag. Reads in this instance should always go to the master.
  2. The locking mechanism does a read first, then a write, then assumes it is safe to proceed, but if many processes do the read at the same time before the first one does a write, there can be multiple processes that "have" the lock. Now, a second read will ensure that only one process has the lock, the others will skip saving the config cache which reduces load on the backend and frontends (which have to serialize the XML which can be quite large).
  3. There is a 5 second "hold" time that will cause processes to wait for the cache to be warmed. I've increased this to 10 seconds since if it takes significantly longer than 5 seconds to save the cache in a high load environment that can result in a large number of other processes generating the cache when it's faster to just wait for it to be saved then read the cache.

One idea I did not implement is to just respond with a 503 error when the hold time is exceeded to avoid having a large number of processes trying to generate a fresh cache all at once. But this would be a more drastic change so I didn't implement that for now.

Manual testing scenarios (*)

I tested point 2 using 20 concurrent requests and some logging lines sprinkled in the code:
image

This proves that the mechanism caused only one process to save the cache where four processes would have previously.

Questions or comments

If anyone has a production environment that experiences cache stampedes, I'd be grateful if you gave it a try and report your findings.

NOTE: For the fix for point 1 to work, the Redis cache library needs to be updated with this as yet unmerged update. I plan to update it soon and it is fully BC and this patch will still work without it, just without having any effect on the issue in point 1.
colinmollenhour/Cm_Cache_Backend_Redis#177

@github-actions github-actions bot added the Component: Core Relates to Mage_Core label Sep 20, 2023
@fballiano
Copy link
Contributor

what do you think about https://github.com/fballiano/magento-cache-regeneration-lock?

@colinmollenhour
Copy link
Member Author

what do you think about https://github.com/fballiano/magento-cache-regeneration-lock?

@fballiano Ahh, very nice! I thought about using GET_LOCK() shortly after pushing this, it is a good strategy.

Pros:

  • Less load on the cache backend
  • Fully atomic so there are no race conditions
  • Simple and fast

Cons:

  • Breaks the database abstraction - we only support MySQL variants anyway so this probably shouldn't be a deal-breaker - it's also used already by Mage_Index so is not exactly new although there is at least the appearance of abstraction there.
  • Will not work perfectly on MySQL multi-master setups unless all writes go the same node. I don't know of anyone using multiple write nodes successfully anyway - do you? If someone was using a multi-master cluster it would probably still work ok, just there would be as many processes writing config cache as there are db masters which is at least guaranteed to be a small number.

Notes:

  • Your patch implements the locking in a different location so I think it needs to also be added to Mage_Core_Model_Config::init to cover all methods of initialization - in a similar manner or some refactoring to prevent code duplication.
  • Minor improvement: I recommend using try { ... } finally { ... } to release locks so that unhandled throwables cannot cause the lock release to be skipped.
  • The 504 error is not exactly accurate - 500 or 503 would be more accurate - since this page could be visible to end users it might warrant using require_once Mage::getBaseDir() . DS . 'errors' . DS . '503.php'; die(); instead - consistent with when session initialization fails.

@fballiano
Copy link
Contributor

@colinmollenhour at the time of developing that patch (around 2015 or a bit before, I was using it for many customers before releasing it) I tried many different ways but the get_lock seems the easiest and still most effective way :-)

cons:

  • exactly, it's already used in mage_index, but also, IMHO, we'll never support anything different from mysql (and derivatives)
  • noup, I don't even know somebody really using slaves DB but no multimaster

pros:

  • it doesn't depend on redis which not everybody may use (probably they don't need the lock because they'll have small stores)
  • since it's managed by mysql itself, it should be pretty reliable

when I developed it I tried to keep it as simple as possible to make it work across multiple versions of magento1, but if we try to implement it in the core we can make it a bit better (like with the try/catch/finally)

do you think I should make a PR out of it? it is possible that the get_lock could create problems to some systems? could it be "disabled" on some databases?

@colinmollenhour
Copy link
Member Author

@fballiano I updated to use GET_LOCK - effectively the same thing your patch does - and removed the old cache-based mechanism and reorganized a bit.

It uses a 3 second timeout for http and a 60 second timeout for CLI - reasoning that for http you don't want to risk overloading your http server in a high load scenario and it's better to basically act like maintenance mode. For CLI there shouldn't be hundreds or thousands of processes so it is ok to wait longer.

Here is a screenshot showing what happens when I added some Mage::log calls during a flood of requests with a cold cache:
image

@colinmollenhour
Copy link
Member Author

@fballiano What do you think of the new PR code? I think it is a great improvement - I hope you don't mind me adding the PR - credit is yours - just trying to move it along. :)

Co-authored-by: Colin Mollenhour <colin@mollenhour.com>
@fballiano
Copy link
Contributor

apart from my last comments it looks fine to me. I ran basic tests and found no problem

Co-authored-by: Fabrizio Balliano <fabrizio.balliano@gmail.com>
@fballiano
Copy link
Contributor

@colinmollenhour I can't find your comment about the timeout anymore but when you say that "if the timeout is 60 but after 3 the cache is generated and requests still keep piling up" I don't think it should be that way, if the lock gets released after 3 seconds then all the piled requests get executed right away, I think.

@fballiano
Copy link
Contributor

actually I think it's good to have a slightly bigger timeout because it forbids other requests (when timeout expires) from restarting the cache generation, that's what crashes servers, much more than requests piling up, at least in my experience

app/code/core/Mage/Core/Model/Config.php Show resolved Hide resolved
app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved
app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2023

Test Results

7 tests   7 ✔️  0s ⏱️
3 suites  0 💤
1 files    0

Results for commit 0cc1f80.

♻️ This comment has been updated with latest results.

@fballiano
Copy link
Contributor

@colinmollenhour there's now a conflict

@colinmollenhour
Copy link
Member Author

@fballiano This one should have been merged before #3533 because that one extends this one.. so most if not all of this is already merged now.. I will take a closer look.

@colinmollenhour
Copy link
Member Author

The changes in this PR are very minor since the bulk of the code was merged with #3533. It now just contains a README update, some syntax fixes and the commits and discussions addressing the concerns of @midlan and @fballiano.

@fballiano
Copy link
Contributor

we need another review in order to merge

@fballiano fballiano changed the title Improve stampede prevention with empty config cache under high loads. Improved stampede prevention with empty config cache under high loads Dec 19, 2023
@fballiano fballiano merged commit e5ddf69 into main Dec 19, 2023
35 checks passed
@fballiano fballiano deleted the improve-stampede-prevention branch December 19, 2023 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants