Description
openedon Aug 27, 2024
Prerequisites
- I have written a descriptive issue title
- I have searched existing issues to ensure the bug has not already been reported
Mongoose version
8.3.2
Node.js version
20.x
MongoDB server version
7.x (Atlas)
Typescript version (if applicable)
5.3.3
Description
We are getting errors whenever our mongo server changes primary server during upgrades.
{
"errorType":"Runtime.UnhandledPromiseRejection",
"errorMessage":"MongoServerError: The server is in quiesce mode and will shut down"
}
We’ve tried running the same code under load testing in a staging environment with mongoose and with just the base mongo driver. With just the base mongo driver we see this error one in a million requests. When we use mongoose nearly 10% of those million requests are failing with this error. It looks like this has [been fixed before](#11661), but there is possible a regression issue?
Using Mongo v7 on Atlas
[Node] mongoose version: 8.3.2
[Node] mongo driver version (for testing): 6.6.2
Our system is using Node.js AWS lambdas (without callbacks; we use async/await) for our APIs. We are using all of the default settings except for { autoIndex: false, bufferCommands: false}
. We believe after investigating that mongoose is somehow throwing this error outside of areas that we can catch. We wrapped all of the mongoose code inside of a try catch statement and it is still throwing this error as an unhandled exception.
We have an extra layer in our lambdas that creates an express app so we can take advantage of a tool called Fern that does api request and response shape validation. The layer creates an express server and wraps it in a function that should still use the async/await pattern using the serverless-http
package. We thought that might be the issue so we added an express middleware to catch errors, but didn’t catch them there either. And, we are seeing these errors on APIs that don’t have that express wrapper.
We were only able to replicate the issue when we started scaling our Atlas mongo server up to M60/M80 and hitting it with large load. Not sure why it needed large load to replicate exactly, but perhaps it is because when there is larger load on the database the quiesce mode lasts longer trending up towards it’s maximum of 15 seconds.
We also ran tests where serverSelectionTimeoutMS
was longer and we completely removed the timeout on our lambdas, so they would have extra time to catch up in case it was just really long timing or something and we needed to change some sort of query performance, but no matter how much time we gave it in testing (even long beyond a reasonable time for APIs to be running) it didn’t solve the problem.
We didn’t get this issue when we hit the few apis that we have on AWS containers. Only in Lambdas.
Steps to Reproduce
While load testing an endpoint run a resilience test in MongoDB Atlas. More details of our system above.
Expected Behavior
Mongoose and the mongo driver together should handle Quiesce mode without errors.