-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transactions cause NodeJS to lock up, locks out all other transactions (reproducer included) #1834
Comments
I just wanted to point out that this issue is particularly problematic because it not only locks up this script. Because it dies in the middle of a transaction, it actually ends up pessimistically locking out any other instances that are trying to start their own transactions. Therefore the document is basically left read-only for several minutes. In addition, the admin sdk is meant to be run on servers and this issue literally causes the server to become unresponsive as it pegs the CPU. Here is a flamegraph of the issue. It appears to be running mallloc over and over again as best I can tell (you need to click the button to view cpp code). This time, github doesn't like me uploading html files, so ive renamed it to flamegraph.txt. |
Please let me know if there is anything I can do to help debug this. This is enough of a dealbreaker that I will need to convert my app to another service if I can' resolve it. Thanks. |
@tcaputi Thank you for the detailed bug report and the reproduction app. Could you please reproduce it with debug logging enabled, and attach the debug logs? To enable debug logging, add this import: import { setLogFunction } from "firebase-admin/firestore"; Then call setLogFunction(console.log); |
Thank you for updating the log. I'll take a closer look next week. |
OK. Let me know if you need any help or additional debugging. Like I said this is kinda the top issue for my app so I'm available if you need help debugging. I'm working on a workaround but its not going to be easy. |
So your logs look similar to mine when I run with the reproduction script. That is, a bunch of lines like this:
which indicate that a transaction was aborted due to contention on the locked documents. The transactions will retry 5 times (by default) with exponential backoff in hopes of eventually succeeding. This really isn't a surprise since the transactions in the test script are running in parallel. All of this is expected behavior. The real problem seems to be that this is consuming 100% of your CPU, bringing down your server, then failing to unlock the locked documents for 270 seconds. This part of your experience I cannot reproduce. Do you have any ideas that I could try to reproduce this? One possible workaround would be to instead use "optimistic locking" like the Firestore client SDKs do. This wouldn't put any locks on your documents, but the drawback is that you'd need to handle the exponential backoff yourself. Here is how you could do this:
Is this something you could try? |
There are a few problems with using the client sdks (in my application)
If at all possible I would like to not workaround the issue in this way. If the issue can't be resolved I would probably move the data to postgresql and just use cloud firestore to push "go get the newest version" messages to the client. This is a big also a big lift which I would like to avoid. In terms of reproduction I can make it happe pretty easily, albeit it usually takes a few attempts so I'm surprised. In some runs I see the errors you mentioned and those make sense (I can deal with errors). However, I can also lock up the process within about 5 attempts very consistently. This problem also seems to happen on my Vercel serverless api which is running Node v16.x and (probably) Linux. Can I ask what OS, nodejs version, and firebase-admin version youre on? I'm also available to demonstrate it to you on slack, discord, phone, google meet, or anything else if its more efficient. Thanks for looking into it. |
Just to be clear, I wasn't suggesting switching to the client sdks! I was only suggesting emulating the optimistic locking that it performs if it is appropriate for your use case, which it sounds like it is not. I'll try some more to reproduce and maybe I'll get lucky. My OS is Linux (a Google internal distro based on debian) with node v18.14.2 and |
Sure. I'll try to come up with that now. |
https://github.com/tcaputi/firestore-crash See the readme for steps. For me this triggered the problem on the first try. Let me know if theres anything else I can do to help. I used the |
Ok good news. I was finally able to reproduce the "cpu pegging" after running the test app you provided about 10 times. I'll now try to reproduce it in a way where I can attach a debugger and dig in. |
Awesome. Not sure why it takes longer on your machine but im glad that its not just me. In trying to figure this out I was able to attach a debugger, but as soon as it hit the issue the pause feature refused to work anymore. I did some more debugging myself with console.log(), but I kinda got lost trying to follow some of the callbacks. Let me know if theres anything else I can help with. |
I've hit a road block in my investigation. I'm going to take a break from investigating and will resume early next week. What I've found so far is that when the 100% CPU usage is observed, the calls to
So far I've written an integration test that intermittently reproduces the bug: #1836. This is how I've been able to attach the debugger and investigate so far. |
This is what I have observed as well. Thanks again. |
@tcaputi I'm going to continue investigating this week. But in case the investigation drags on, it could be worthwhile to try out the "optimistic locking" style described in #1834 (comment). To reiterate, the suggestion is not to use the client JavaScript SDK but rather to implement its style of transactions using the APIs in this nodejs-firestore SDK. Something else to try would be to reduce the size of the document in question. It is a very large document and its sheer size likely contributes to this problem. Redesigning this "monolithic" document could also help your app scale as it grows. |
I understand. Thank you for looking into it. I will probably try this first.
Got it. I assumed because I am ~10x under the maximum size that it would be ok. Its a shame because my app is a bit hard to split up. The app is designed for running tournaments, which consist of a set of games. When a game updates that update may need to cascade changes to a bunch of other games. There isnt a good way to know in advance exactly which games would need to be updated until I do the calculations so either I would need to (1) eat the cost of multiple roundtrips to firestore each time I need to fetch a new game or (2) just fetch all of them every time I want to do an update. The first option will increase my minimum latency significantly and make my maximum latency virtually O(log(n)) where n is the number of games, which is not acceptable. The code would also get significantly more complicated, which would fine if it had a more obvious benefit. The second option would multiply my read cost by n, which would literally mean the cost of running a tournament is inflated hundreds of times over. But it seems to me, based on your comments, that this kind of workload may not be ideal for firestore. If the optimistic locking option doesn't work out I think I'm going to look into switching my project over to use postgresSQL + a cache to store / serve the data, and then I can just use firestore to push out notifications that the data has changed. Please let me know if you come up with a solution / workaround. |
I haven't made much progress finding a root cause, but here is something you could try. In order to try and figure out where the code is "spinning" I tried compiling node from source. Lo and behold, the issue did not reproduce. I tried v18.4.0 and HEAD (v20.0.0-pre), both "release" and "debug" builds, and none of them would reproduce this issue. The point is, do you have any flexibility to try different node.js runtimes? Different versions? Different modes of compilation? If you do, as a workaround, you could try some different node.js runtimes and see if any of them work for you. |
I can try some things locally. Vercel only really gives me 14.x 16.x and 18.x as options. My code is currently running 16.x. I will give 18.x a shot in a minute |
Switching to 18.x seems to have made the problem much harder to reproduce in my app. I'm going to need to do some scale testing but this looks promising. I have no idea what could have happened in node to cause this problem, but it would explain why it was so hard to debug. |
I just tried to reproduce the issue on my live app by having a friend and i both click buttons as fast as we could in the ui at the same time. We were still able to reproduce the "document locked for several minutes" problem, but it required us to both mash transactions as hard as we could for about 30s (probably about 4 tx/second). I understand that this is not in spec for how firestore is meant to be used, but it was meant as a stress test. That said, its not 100% clear if the document ended up locked because we hit this CPU spinning bug or simply because the serverless function was timed out and killed in the middle of a transaction. I was wondering if you could help me understand a couple things about how firestore handles errors / timeouts.
|
No, it does not.
I don't quite follow what you're suggesting. At a high level this sounds reasonable though, except there is no way to cancel an in-flight transaction.
No. But one option would be to set the My gut instinct still tells me that using the "optimistic concurrency" will likely yield good results for your use case. Have you had a chance to give that a try? |
I assumed that the way transactions worked was:
Assuming this is right (and please tell me if its not), my thought is that the transaction must be spending most of its time waiting at step 2 when there is contention. I also know that (again, correct me if I'm wrong) I can cancel a transaction by throwing an error inside the callback. So my thought was that I could
Not yet, I might have time to try that tomorrow / the next day but my other job is taking up a good chunk of my time right now. |
The way that Firestore transactions work is a little different that your understanding. Also, the "race" idea won't work. Here is how Firestore's transactions work:
So the "racing" won't work because the call to Transaction.get() is not blocked but rather the "commit" request sent to the backend is blocked waiting for the locks to be released, and there is no way to abort this request. |
Thank you very much for your help. I'm going to think about this for a little bit and try some solutions when work calms down (hopefully by friday). |
I'm sorry for leaving this for so long. Using v18 of node seemed to make things much more stable, but I am still able to reproduce the problem in my production app (I just need more people pressing buttons at once). I appreciate the help you provided, but I have decided to go a slightly different route. I am a bit worried about the optimistic locking idea you mentioned because it seems like it won't be ideal performance-wise in the case that causes this problem. If users are rapidly clicking / making changes in the UI then I will still need to exponentially back these requests off which I'm afraid will result in fairly unpredictable latencies and a bad UX. In addition, I have been adding more and more features and its becoming apparent that I will probably want SQL in the near future (at least for some of my queries). So I did some testing and I think I'm going to store the 'single point of truth' of the document in a postgres jsonb column. From there I can use postgres transactions exactly as I am currently using firestore transactions. Once everything is resolved I can simply write a copy to cloud firestore without any transactions to push the changes to my clients in real time. In my testing this seems to work no matter how fast I press the buttons and its actually lower latency for the user who initiated the transaction since they don't have to wait for firestore. I think if I had to ask for one feature that would have saved me a lot of time, I would want the ability to configure how long a transaction can wait for me app. I know the hobby tier of vercel only gives me 10s before timing out my serverless function and it would be great if I could tell firestore "Hey if you can't finish this in 5 seconds just drop it because the serverless function won't be around in 60s". I'm happy to talk about this more if you need, but I think I have a solution for now. |
Thank you for the explanation and bearing with me as I investigated. Your proposed solution sounds pretty reasonable (using postgres and pushing the final result to firestore). I wish things could have worked out better for using a Firestore-only solution. It still baffles me how things were getting stuck in node. I'll bring up your suggestion on transaction timeouts with the team and see what their thoughts are. Given our current priorities, it wouldn't be soon that we'd be able to address this. Nonetheless, thank you for your time in investigating this issue and all the best. |
Thanks for stopping by to let us know something could be better!
PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.
Is this a client library issue or a product issue?
I believe it is a problem with both. There is certainly a problem with the library
Did someone already solve this?
Not as far as I have been able to tell. It is not fixed on the latest version of the
firebase-admin
package (11.5.0)Do you have a support contract?
No
If the support paths suggested above still do not result in a resolution, please provide the following details.
Environment details
@google-cloud/firestore
version: 6.4.1 (I installed it as part of firebase-admin which is version 11.5.0)Steps to reproduce
Follow the instructions at the bottom of this script. It is a large file but most of that is test data from my app (it seems easier to reproduce with large objects). Github doesnt want me uploading mjs files so rename it to crash.mjs. You will need a firebase service account.
crash.txt
Making sure to follow these steps will guarantee the quickest resolution possible.
Thanks!
The text was updated successfully, but these errors were encountered: