Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongoose findOneAndUpdate hangs under heavy load #14877

Open
2 tasks done
andreialecu opened this issue Sep 10, 2024 · 25 comments
Open
2 tasks done

Mongoose findOneAndUpdate hangs under heavy load #14877

andreialecu opened this issue Sep 10, 2024 · 25 comments
Labels
has repro script There is a repro script, the Mongoose devs need to confirm that it reproduces the issue

Comments

@andreialecu
Copy link
Contributor

Prerequisites

  • I have written a descriptive issue title
  • I have searched existing issues to ensure the bug has not already been reported

Mongoose version

8.5.1

Node.js version

20.17.0

MongoDB server version

7.0.12

Typescript version (if applicable)

No response

Description

We have a batch job that does a bunch of updates that used to work a while ago, but recently started hanging on calls to findOneAndUpdate. The promise stops resolving after a while.

I have been able to create a repro for this, but it seems to only viably reproduce on MongoDB Atlas (or probably other remote servers). On localhost it doesn't seem to reproduce.

Also if I bypass mongoose entirely and use the mongodb driver directly, it seems to work.

Steps to Reproduce

const mongoose = require('mongoose');
const Parallel = require('async-parallel');
const { setTimeout } = require('timers/promises');

const ObjectId = mongoose.Types.ObjectId;
const Schema = mongoose.Schema;

const testSchema = new Schema({
  fieldA: {
    type: Schema.Types.ObjectId,
    required: true,
  },
  fieldB: {
    type: String,
    required: true,
  },
  fieldC: {
    type: Number,
    default: 0,
  },
  fieldD: {
    subFieldA: [String],
    subFieldB: [String],
    subFieldC: [String],
  },
  fieldK: {
    type: Date,
    default: Date.now,
  },
});

const Test = mongoose.model('Test', testSchema);

async function createObject() {
  await setTimeout(100 + Math.random() * 100);
  return {
    fieldA: new ObjectId(),
    // random hex
    fieldB: new ObjectId().toHexString(),
    fieldC: 0,
    fieldD: {
      // make this big:
      subFieldA: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
      subFieldB: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
      subFieldC: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
    },
  };
}

async function run() {
  await mongoose.connect(
    // 'mongodb://localhost:27017/testmongoose', // could not reproduce on localhost
    'mongodb+srv://....mongodb.net/testmongoose?ssl=true&authSource=admin', // replace with a mongodb atlas connection string
  );

  let tidx = 0;
  let total = 10000;
  await Parallel.each(
    Array.from({ length: total }),
    async (_, i) => {
      const obj = await createObject();
      const result = await Promise.race([
        Test.findOneAndUpdate(
          { fieldB: obj.fieldB, fieldK: { $lt: new Date() } },
          obj,
          {
            upsert: true,
          },
        ),
        // following is faster and doesn't have the issue
        // Test.db.collection('test123').findOneAndUpdate(
        //     { fieldB: obj.fieldB, fieldK: { $lt: new Date() } },
        //     { $set: obj },
        //     {
        //       upsert: true,
        //     },
        //   ),
        setTimeout(30000, new Error('timeout')),
      ]);
      if (result instanceof Error) {
        console.log(`Timeout after ${tidx}/${total}`);
        process.exit(1);
      }
      tidx++;
      console.log(`${tidx}/${total}`);
    },
    { concurrency: 20 },
  );

  console.log('done');
  mongoose.disconnect();
}

run();

Expected Behavior

This must've started happening after some recent update to either mongoose, mongodb, or a mongo version. Unclear which, we haven't ran this job in a while and just noticed it occuring now.

@andreialecu
Copy link
Contributor Author

Output of script:

....
1223/10000
1224/10000
Timeout after 1224/10000

If mongoose is bypassed and the other code that is commented out is swapped in, the issue doesn't seem to occur.

@andreialecu
Copy link
Contributor Author

It seems that if you make the schema bigger/more bloated with additional fields, even if they are not populated, the issue occurs earlier after about 400 iterations. But this could be a fluke.

@hunkydoryrepair
Copy link

hunkydoryrepair commented Sep 11, 2024

I have been fighting this issue have have not been able to go above version 8.2.2
Any higher version and I will eventually get hangs doing around 50+ updates / second. And we do have a rather large, complex schema defined.
I have also had it hang on simple findOne calls. But I have never reproduced outside of production.

What I have found in production is that the database connection is checked out from the pool and remains checked out.
The connection on db side, using db.currentOp, does not appear. My count of "checked out" connections eventually grows to the limit, while the number of connections (in use or idle) on db server remains very low.

I assume what I experience must be mongodb, since everything from where the connection is checked out to where it is checked back in, is down in the mongodb driver.

@hunkydoryrepair
Copy link

Here is a function I'm using to look at the mongodb internals and see if the connections have been checked out from the connection pool. I was suspecting that it was a connection pool issue, but that proved to be wrong. Could you check if you are leaking connections from connection pool in your tests?

const connectionStats = () => {
  const client = mongoose.connection?.getClient();
  if (!client) return `Mongo Client Not Loaded`;
  try {
    const topology = (client as any).topology?.s;
    let s =  `Mongo: ${topology?.state}`;
    topology?.servers?.forEach((server: any, key: string) => {
      s += `<br>${key}: `;
      s += ' CO: ' + server.pool?.currentCheckedOutCount ?? '?';
      s += ' P: ' + server.pool?.pendingConnectionCount ?? '?';
      s += ' Q: ' + server.pool?.waitQueueSize ?? '?';
      s += ' A: ' + server.pool?.availableConnectionCount ?? '?';
      if (server.pool?.loadBalanced) {
        s += '[LOAD BALANCED]';
      }
    });
    return s;
  } catch (e) {
    console.log(`mongo pool hack didn't work`, e);
    return `Mongo: Error`;
  }

}

@andreialecu
Copy link
Contributor Author

I have been fighting this issue have have not been able to go above version 8.2.2

Thank you for this!

I've tried running my repro on several mongoose versions, and here are my findings:

8.3.0 repros the issue
8.2.3 seems to work fine

The issue starts in 8.3.0 /cc @vkarpov15

@hunkydoryrepair
Copy link

here's what I found:
Issue still in 8.6.1
8.3.0 fails indeed, which is when mongodb driver was updated to 6.5. 8.6.1 uses version 6.8.

If we are experiencing the same thing setting socketTimeoutMS will cause it to eventually throw MongoNetworkTimeoutError

@andreialecu
Copy link
Contributor Author

andreialecu commented Sep 11, 2024

Update: I used yarn resolutions to pin mongodb to 6.3.0 and it appears that the problem no longer reproduces with my test script.

Indeed this may be an upstream issue in the mongodb nodejs driver.

@andreialecu
Copy link
Contributor Author

It appears that the problem started with mongodb 6.4.0 - I've been trying various resolutions. 6.3.0 works, 6.4.0 exhibits the problem.

@andreialecu
Copy link
Contributor Author

@andreialecu
Copy link
Contributor Author

I can confirm that we resolved our issue with the workaround of pinning mongodb to 6.3.0 using yarn resolutions.

This is not a mongoose issue. mongoose 8.5.1 with mongodb 6.3.0 seems to work.

I'll keep this issue open since the mongodb node driver doesn't have an issue tracker, and perhaps someone from the mongodb team sees it here.

A temporary fix for mongoose might be to revert the mongodb driver bump back to 6.3.0

@hunkydoryrepair
Copy link

i think worth keeping open because Mongoose should at least have some option for using the working 6.3.0

@andreialecu
Copy link
Contributor Author

Seems it might be a good idea to have mongoose define a peer dependency to mongodb instead of a direct dependency. In this case the version of the driver would be user controllable.

Not sure why it's not already that way.

@vkarpov15 vkarpov15 added the has repro script There is a repro script, the Mongoose devs need to confirm that it reproduces the issue label Sep 13, 2024
@vkarpov15
Copy link
Collaborator

I'm unable to repro this on my end with Mongoose 8.6.2, although I had to set upsert to false to avoid maxing out my Atlas storage limits.

Mongoose has always pinned exact versions of the MongoDB Node driver because in the past Mongoose was more tightly coupled with the MongoDB Node driver and patch releases would cause breaking issues. However, that hasn't been the case in some time, so we've considered relaxing Mongoose's Node driver version requirements.

@andreialecu if we changed Mongoose's dependencies to "mongodb": "^6.3.0" would that help? Mongoose 8.6 tests do pass with mongodb@6.3, 6.4, 6.5, 6.6, and 6.7.

@andreialecu
Copy link
Contributor Author

That would probably work, but I think making mongodb a peer dependency might still be a good idea.

Or a peer dependency with a default. See: yarnpkg/berry#3710 (comment)

@andreialecu
Copy link
Contributor Author

You can probably get it to repro if you use upsert and update the same document repeatedly.

I haven't tried optimizing the repro.

I noticed it happens easier with bigger documents. The linked issues in the mongodb issue tracker also mention this.

It may still happen with smaller documents, just at a lower frequency. We've had some weird issues for a while which I think we can attribute to this problem.

@hunkydoryrepair
Copy link

hunkydoryrepair commented Sep 15, 2024

I have a test case that seems pretty reliable, at least with atlas. It might hit storage limits, but it is a simple Model.create([]) with a large array of data. I haven't cleaned out some confidential data (but not personal information of any kind, and not too big a deal if it gets out). I could provide to you @vkarpov15

I'm not certain if it requires the schema or not, but it's about 600 fairly large objects, and from what I can tell it would be kicking off 600 insertOne calls simultaneously, which should pretty much max out the connection pool

@vkarpov15
Copy link
Collaborator

npm doesn't install peer dependencies by default AFAIK, which makes making mongodb a peer dep likely a non-starter.

@hunkydoryrepair can you please send me the script? You can just email me val at karpov dot io, if that's preferred to posting the script on GitHub.

@andreialecu
Copy link
Contributor Author

npm doesn't install peer dependencies by default AFAIK, which makes making mongodb a peer dep likely a non-starter.

It does since 2020: https://github.blog/news-insights/product-news/presenting-v7-0-0-of-the-npm-cli/

This has been actually quite controversial. The Yarn maintainers were vehemently opposed to this default.

Yarn and I think pnpm will not install peer dependencies automatically, but they will install peer dependencies with defaults (see my previous comment).

@hunkydoryrepair
Copy link

the case I thought was the same is a different bug, I think. I have a solid repro case but it works in 8.4.0 reliably, and consistently hangs in 8.4.1 or higher. I will open a separate bug

@hunkydoryrepair
Copy link

@andreialecu
I just discovered a reproducible case that is a little more direct.
I believe it has to do with the IDLE CONNECTION TIMEOUT from the server. On Atlas, it is 10 minutes, but running locally it can be much, much longer. I added a 10 minute pause in the middle of my test cases and reproduced it much easier

@andreialecu
Copy link
Contributor Author

andreialecu commented Sep 20, 2024

The test case I shared in my initial post was 100% reproducible to me.

You weren't able to reproduce it?

It's possible it could be related to latency to Atlas as I was running it locally and my ping to the cluster is 30-60ms.

@hunkydoryrepair
Copy link

I wasn't able to consistently reproduce with it, even when connecting to Atlas. It timed out a few times for me, but it was pretty slow other times, so it wasn't super convincing. And i go through a VPN to connect to Atlas so it is pretty slow, but I'm not sure it was reaching the idle timeouts. And the complicated schema ones seem to be a different bug from what I can tell, as it does some N! operations where N=document depth, so even getting 10 levels deep would seem to hang.

You also said you could not produce locally, and I think this is why.

@andreialecu
Copy link
Contributor Author

I was only able to reproduce it with certain mongodb driver versions, and it was consistent.

In our production app, pinning to the driver version I mentioned resolved the issue.

@hunkydoryrepair
Copy link

@andreialecu Thank you. You inspired me to dig into this again using your use case. The issue I ran into is that your repro stops producing the problem if it is run multiple times, but I added a Test.deleteMany({}) at the beginning, and it was suddenly possible.

I spent the day tracking it down in the debugger and found a solution, which I added to this issue. Mongodb team didn't put the listeners on the 'data' event of the socket in time to catch all of the 'data' events, so it misses the returned value. This happens only with larger commands which exceed the kernel socket buffer, which causes the socket.write command to return a value indicating it needs to be drained, and the mongodb driver was awaiting that event before listening for the 'data' event, and thus, before the task awaiting the 'drain' is processed, the 'data' event comes back and is emitted. Anyway....here's hoping they accept my solution!

https://jira.mongodb.org/projects/NODE/issues/NODE-6370

@hunkydoryrepair
Copy link

Can we keep this open until a fix from mongodb is available and incorporated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has repro script There is a repro script, the Mongoose devs need to confirm that it reproduces the issue
Projects
None yet
Development

No branches or pull requests

3 participants