Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactions cause NodeJS to lock up, locks out all other transactions (reproducer included) #1834

Closed
tcaputi opened this issue Mar 22, 2023 · 27 comments
Assignees
Labels
api: firestore Issues related to the googleapis/nodejs-firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tcaputi
Copy link

tcaputi commented Mar 22, 2023

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

  1. Is this a client library issue or a product issue?
    I believe it is a problem with both. There is certainly a problem with the library

  2. Did someone already solve this?
    Not as far as I have been able to tell. It is not fixed on the latest version of the firebase-admin package (11.5.0)

  3. Do you have a support contract?
    No

If the support paths suggested above still do not result in a resolution, please provide the following details.

Environment details

  • OS: MacOS 13.2.1 (M1) (can confirm it also occurs on Ubuntu and in lambda functions)
  • Node.js version: v19.7.0
  • npm version: 9.5.0
  • @google-cloud/firestore version: 6.4.1 (I installed it as part of firebase-admin which is version 11.5.0)

Steps to reproduce

Follow the instructions at the bottom of this script. It is a large file but most of that is test data from my app (it seems easier to reproduce with large objects). Github doesnt want me uploading mjs files so rename it to crash.mjs. You will need a firebase service account.

crash.txt

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

@tcaputi tcaputi added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Mar 22, 2023
@product-auto-label product-auto-label bot added the api: firestore Issues related to the googleapis/nodejs-firestore API. label Mar 22, 2023
@tcaputi
Copy link
Author

tcaputi commented Mar 22, 2023

I just wanted to point out that this issue is particularly problematic because it not only locks up this script. Because it dies in the middle of a transaction, it actually ends up pessimistically locking out any other instances that are trying to start their own transactions. Therefore the document is basically left read-only for several minutes.

In addition, the admin sdk is meant to be run on servers and this issue literally causes the server to become unresponsive as it pegs the CPU.

Here is a flamegraph of the issue. It appears to be running mallloc over and over again as best I can tell (you need to click the button to view cpp code). This time, github doesn't like me uploading html files, so ive renamed it to flamegraph.txt.
flamegraph.txt

@tcaputi
Copy link
Author

tcaputi commented Mar 22, 2023

Please let me know if there is anything I can do to help debug this. This is enough of a dealbreaker that I will need to convert my app to another service if I can' resolve it.

Thanks.

@dconeybe dconeybe self-assigned this Mar 23, 2023
@dconeybe
Copy link
Contributor

@tcaputi Thank you for the detailed bug report and the reproduction app. Could you please reproduce it with debug logging enabled, and attach the debug logs?

To enable debug logging, add this import:

import { setLogFunction } from "firebase-admin/firestore";

Then call

setLogFunction(console.log);

@tcaputi
Copy link
Author

tcaputi commented Mar 23, 2023

@dconeybe
crash.log

Let me know if there is anything else I can provide. For what its worth (and for reasons I dont really understand) this issue is easier to reproduce in my production code than with this test, but it does work.

@dconeybe
Copy link
Contributor

Thank you for updating the log. I'll take a closer look next week.

@tcaputi
Copy link
Author

tcaputi commented Mar 25, 2023

OK. Let me know if you need any help or additional debugging. Like I said this is kinda the top issue for my app so I'm available if you need help debugging. I'm working on a workaround but its not going to be easy.

@dconeybe
Copy link
Contributor

So your logs look similar to mine when I run with the reproduction script. That is, a bunch of lines like this:

Received error: Error: 10 ABORTED: Aborted due to cross-transaction contention. This occurs when multiple transactions attempt to access the same data, requiring Firestore to abort at least one in order to enforce serializability.

which indicate that a transaction was aborted due to contention on the locked documents. The transactions will retry 5 times (by default) with exponential backoff in hopes of eventually succeeding. This really isn't a surprise since the transactions in the test script are running in parallel. All of this is expected behavior.

The real problem seems to be that this is consuming 100% of your CPU, bringing down your server, then failing to unlock the locked documents for 270 seconds. This part of your experience I cannot reproduce. Do you have any ideas that I could try to reproduce this?

One possible workaround would be to instead use "optimistic locking" like the Firestore client SDKs do. This wouldn't put any locks on your documents, but the drawback is that you'd need to handle the exponential backoff yourself. Here is how you could do this:

  1. Get a document snapshot via DocumentReference.get()
  2. Store the resulting DocumentSnapshot into a local variable named snapshot.
  3. Call DocumentReference.update() with the UpdateData and { lastUpdateTime: snapshot.updateTime } for the precondition.
  4. In the "update" operation fails with code FAILED_PRECONDITION then wait for a few seconds and try again.

Is this something you could try?

@tcaputi
Copy link
Author

tcaputi commented Mar 29, 2023

There are a few problems with using the client sdks (in my application)

  • I can't use a service account. I could possibly make one 'god' user but it will make my permissions significantly more complicated. I can work around this if needed.
  • I'm using the new client sdk for the web clients which has completely different function signatures for tree-shakability. Again, I can work around this, but it is very annoying to change all of my code around (especially if this bug can be fixed and i can move back to the admin sdk)
  • As you mentioned, I would need to implement exponential backoff myself.
  • Since I would be moving to optimistic locking I can get into scenarios where its really hard for updates to eventually make it through. The one compromise im making with my data schema right now is that contention can happen. In aggregate, this should be well within the bounds of what cloud firestore recommends but I can't prevent 3 different users from clicking buttons at once every once in a while.

If at all possible I would like to not workaround the issue in this way. If the issue can't be resolved I would probably move the data to postgresql and just use cloud firestore to push "go get the newest version" messages to the client. This is a big also a big lift which I would like to avoid.

In terms of reproduction I can make it happe pretty easily, albeit it usually takes a few attempts so I'm surprised. In some runs I see the errors you mentioned and those make sense (I can deal with errors). However, I can also lock up the process within about 5 attempts very consistently. This problem also seems to happen on my Vercel serverless api which is running Node v16.x and (probably) Linux. Can I ask what OS, nodejs version, and firebase-admin version youre on?

I'm also available to demonstrate it to you on slack, discord, phone, google meet, or anything else if its more efficient.

Thanks for looking into it.

@dconeybe
Copy link
Contributor

Just to be clear, I wasn't suggesting switching to the client sdks! I was only suggesting emulating the optimistic locking that it performs if it is appropriate for your use case, which it sounds like it is not. I'll try some more to reproduce and maybe I'll get lucky. My OS is Linux (a Google internal distro based on debian) with node v18.14.2 and @google-cloud/firestore version 6.5.0. I'm not using the firebase-admin SDK, but, rather, using this SDK directly. If you're able, could you provide a full project that I could clone off of GitHub, with package.json, etc? That way I could clone it and minimize the differences between your environment and my environment.

@tcaputi
Copy link
Author

tcaputi commented Mar 30, 2023

Sure. I'll try to come up with that now.

@tcaputi
Copy link
Author

tcaputi commented Mar 30, 2023

https://github.com/tcaputi/firestore-crash

See the readme for steps. For me this triggered the problem on the first try. Let me know if theres anything else I can do to help. I used the firebase-admin package for this reproducer but if you need me to i could probably modify it to just use this repo. I'll try to ask some friends of mine to run it on different platforms to run it as well.

@dconeybe
Copy link
Contributor

Ok good news. I was finally able to reproduce the "cpu pegging" after running the test app you provided about 10 times. I'll now try to reproduce it in a way where I can attach a debugger and dig in.

@tcaputi
Copy link
Author

tcaputi commented Mar 30, 2023

Awesome. Not sure why it takes longer on your machine but im glad that its not just me. In trying to figure this out I was able to attach a debugger, but as soon as it hit the issue the pause feature refused to work anymore. I did some more debugging myself with console.log(), but I kinda got lost trying to follow some of the callbacks.

Let me know if theres anything else I can help with.

@dconeybe
Copy link
Contributor

I've hit a road block in my investigation. I'm going to take a break from investigating and will resume early next week.

What I've found so far is that when the 100% CPU usage is observed, the calls to this.innerApiCalls.commit(request, options, callback) are not invoking the given callback; however, when I try to step into this code with the IntelliJ debugger it just skips over it.

return this.innerApiCalls.commit(request, options, callback);

So far I've written an integration test that intermittently reproduces the bug: #1836. This is how I've been able to attach the debugger and investigate so far.

@tcaputi
Copy link
Author

tcaputi commented Mar 30, 2023

This is what I have observed as well. Thanks again.

@dconeybe
Copy link
Contributor

dconeybe commented Apr 3, 2023

@tcaputi I'm going to continue investigating this week. But in case the investigation drags on, it could be worthwhile to try out the "optimistic locking" style described in #1834 (comment). To reiterate, the suggestion is not to use the client JavaScript SDK but rather to implement its style of transactions using the APIs in this nodejs-firestore SDK.

Something else to try would be to reduce the size of the document in question. It is a very large document and its sheer size likely contributes to this problem. Redesigning this "monolithic" document could also help your app scale as it grows.

@tcaputi
Copy link
Author

tcaputi commented Apr 3, 2023

But in case the investigation drags on, it could be worthwhile to try out the "optimistic locking" style described in #1834 (comment)

I understand. Thank you for looking into it. I will probably try this first.

Something else to try would be to reduce the size of the document in question. It is a very large document and its sheer size likely contributes to this problem.

Got it. I assumed because I am ~10x under the maximum size that it would be ok. Its a shame because my app is a bit hard to split up. The app is designed for running tournaments, which consist of a set of games. When a game updates that update may need to cascade changes to a bunch of other games. There isnt a good way to know in advance exactly which games would need to be updated until I do the calculations so either I would need to (1) eat the cost of multiple roundtrips to firestore each time I need to fetch a new game or (2) just fetch all of them every time I want to do an update.

The first option will increase my minimum latency significantly and make my maximum latency virtually O(log(n)) where n is the number of games, which is not acceptable. The code would also get significantly more complicated, which would fine if it had a more obvious benefit.

The second option would multiply my read cost by n, which would literally mean the cost of running a tournament is inflated hundreds of times over.

But it seems to me, based on your comments, that this kind of workload may not be ideal for firestore. If the optimistic locking option doesn't work out I think I'm going to look into switching my project over to use postgresSQL + a cache to store / serve the data, and then I can just use firestore to push out notifications that the data has changed.

Please let me know if you come up with a solution / workaround.
Thanks.

@dconeybe
Copy link
Contributor

dconeybe commented Apr 4, 2023

I haven't made much progress finding a root cause, but here is something you could try. In order to try and figure out where the code is "spinning" I tried compiling node from source. Lo and behold, the issue did not reproduce. I tried v18.4.0 and HEAD (v20.0.0-pre), both "release" and "debug" builds, and none of them would reproduce this issue. The point is, do you have any flexibility to try different node.js runtimes? Different versions? Different modes of compilation? If you do, as a workaround, you could try some different node.js runtimes and see if any of them work for you.

@tcaputi
Copy link
Author

tcaputi commented Apr 4, 2023

I can try some things locally. Vercel only really gives me 14.x 16.x and 18.x as options. My code is currently running 16.x. I will give 18.x a shot in a minute

@tcaputi
Copy link
Author

tcaputi commented Apr 4, 2023

Switching to 18.x seems to have made the problem much harder to reproduce in my app. I'm going to need to do some scale testing but this looks promising. I have no idea what could have happened in node to cause this problem, but it would explain why it was so hard to debug.

@tcaputi
Copy link
Author

tcaputi commented Apr 4, 2023

I just tried to reproduce the issue on my live app by having a friend and i both click buttons as fast as we could in the ui at the same time. We were still able to reproduce the "document locked for several minutes" problem, but it required us to both mash transactions as hard as we could for about 30s (probably about 4 tx/second). I understand that this is not in spec for how firestore is meant to be used, but it was meant as a stress test.

That said, its not 100% clear if the document ended up locked because we hit this CPU spinning bug or simply because the serverless function was timed out and killed in the middle of a transaction. I was wondering if you could help me understand a couple things about how firestore handles errors / timeouts.

  1. Does this library have any kind of SIGTERM handling to clean up transactions in the event that a serverless function gets timed out and killed (or even just in the event that a process crashes)?
  2. If the library doesnt have SIGTERM handling is it possible to set up a Promise.race() so that if I'm getting close to the serverless timeout I can make sure things get cleaned up before the document ends up stuck?
  3. If a document is locked is there any way for me to cancel pending transactions to it ?

@dconeybe
Copy link
Contributor

dconeybe commented Apr 4, 2023

Does this library have any kind of SIGTERM handling?

No, it does not.

If the library doesn't have SIGTERM handling is it possible to set up a Promise.race()...?

I don't quite follow what you're suggesting. At a high level this sounds reasonable though, except there is no way to cancel an in-flight transaction.

If a document is locked is there any way for me to cancel pending transactions to it ?

No. But one option would be to set the maxAttempts of the transaction options (an optional argument to runTransaction()) to 1. Then if the there is contention the transaction will not get automatically retried.

My gut instinct still tells me that using the "optimistic concurrency" will likely yield good results for your use case. Have you had a chance to give that a try?

@tcaputi
Copy link
Author

tcaputi commented Apr 4, 2023

I don't quite follow what you're suggesting. At a high level this sounds reasonable though, except there is no way to cancel an in-flight transaction.

I assumed that the way transactions worked was:

  1. client tells the server that it is starting a transaction
  2. client reads required documents from the server. the server locks these documents as it goes. if a document it needs is already locked the response is held until it is unlocked.
  3. client calculates and assembles something equivalent to a 'batch write' and sends it to the server
  4. server applies batched writes and unlocks the documents

Assuming this is right (and please tell me if its not), my thought is that the transaction must be spending most of its time waiting at step 2 when there is contention. I also know that (again, correct me if I'm wrong) I can cancel a transaction by throwing an error inside the callback. So my thought was that I could Promise.race() the tx.get() and a simple timeout promise (lets say 5 seconds). If the tx.get() finishes first we continue on with the transaction like normal. If it doesn't return before the timeout I throw an error within the transaction callback. I think this would cancel the transaction. Let me know if this sounds like it would work.

My gut instinct still tells me that using the "optimistic concurrency" will likely yield good results for your use case. Have you had a chance to give that a try?

Not yet, I might have time to try that tomorrow / the next day but my other job is taking up a good chunk of my time right now.

@dconeybe
Copy link
Contributor

dconeybe commented Apr 4, 2023

The way that Firestore transactions work is a little different that your understanding. Also, the "race" idea won't work.

Here is how Firestore's transactions work:

  1. Firestore.runTransaction() makes a server request to begin a transaction, and records the resulting "transaction ID".
  2. The callback specified to Firestore.runTransaction() is invoked.
  3. If the callback invokes Transaction.get() then a read lock is acquired on that document.
  4. If the callback invokes Transaction.set() (or update) then the "write" is saved for later execution.
  5. Firestore.runTransaction() awaits the Promise returned from the callback.
  6. If the Promise was fulfilled, then another server request is made to "commit" the transaction, which includes the "transaction ID" and the "writes" that were requested in the callback. As part of the commit, it attempts to upgrade any "read locks" to "read/write locks", waiting for up to 60 seconds for all other locks to be released. If waiting times out then ABORTED is returned from the server and the client backs off and retries. Otherwise, the writes go through and the locks are released.

So the "racing" won't work because the call to Transaction.get() is not blocked but rather the "commit" request sent to the backend is blocked waiting for the locks to be released, and there is no way to abort this request.

@tcaputi
Copy link
Author

tcaputi commented Apr 5, 2023

Thank you very much for your help. I'm going to think about this for a little bit and try some solutions when work calms down (hopefully by friday).

@tcaputi
Copy link
Author

tcaputi commented Apr 30, 2023

I'm sorry for leaving this for so long. Using v18 of node seemed to make things much more stable, but I am still able to reproduce the problem in my production app (I just need more people pressing buttons at once).

I appreciate the help you provided, but I have decided to go a slightly different route. I am a bit worried about the optimistic locking idea you mentioned because it seems like it won't be ideal performance-wise in the case that causes this problem. If users are rapidly clicking / making changes in the UI then I will still need to exponentially back these requests off which I'm afraid will result in fairly unpredictable latencies and a bad UX.

In addition, I have been adding more and more features and its becoming apparent that I will probably want SQL in the near future (at least for some of my queries). So I did some testing and I think I'm going to store the 'single point of truth' of the document in a postgres jsonb column. From there I can use postgres transactions exactly as I am currently using firestore transactions. Once everything is resolved I can simply write a copy to cloud firestore without any transactions to push the changes to my clients in real time. In my testing this seems to work no matter how fast I press the buttons and its actually lower latency for the user who initiated the transaction since they don't have to wait for firestore.

I think if I had to ask for one feature that would have saved me a lot of time, I would want the ability to configure how long a transaction can wait for me app. I know the hobby tier of vercel only gives me 10s before timing out my serverless function and it would be great if I could tell firestore "Hey if you can't finish this in 5 seconds just drop it because the serverless function won't be around in 60s".

I'm happy to talk about this more if you need, but I think I have a solution for now.

@dconeybe
Copy link
Contributor

dconeybe commented May 5, 2023

Thank you for the explanation and bearing with me as I investigated. Your proposed solution sounds pretty reasonable (using postgres and pushing the final result to firestore). I wish things could have worked out better for using a Firestore-only solution. It still baffles me how things were getting stuck in node.

I'll bring up your suggestion on transaction timeouts with the team and see what their thoughts are. Given our current priorities, it wouldn't be soon that we'd be able to address this. Nonetheless, thank you for your time in investigating this issue and all the best.

@dconeybe dconeybe closed this as completed May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the googleapis/nodejs-firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

2 participants