Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud NDB - suppressed failures to set and delete in memcache cause cache inconsistency #656

Closed
justinkwaugh opened this issue May 18, 2021 · 3 comments · Fixed by #665
Closed
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@justinkwaugh
Copy link

justinkwaugh commented May 18, 2021

Cloud NDB v 1.8.0, app engine standard python 3

Ok, after much wailing and gnashing of teeth, I finally got to the root of a cache inconsistency issue I was experiencing. It's transient, but happens regularly, and I finally pinned it down.

At a high level, the problem is that both the setting and the deleting of the lock value can fail silently (and will do so for the same key in the same put call more than you would think!). This completely invalidates any attempts that _datastore_api.put() is doing to maintain cache consistency, as the stale value is just left in memcache.

pymemcache by default adds noreply to many commands -- in this case the problematic ones were set_many / delete_many. It is apparently not at all unlikely for a set or delete to fail, in which case pymemcache just silently suppresses the failure. If noreply=True is added to the call, then it will return a list of keys that failed from set_many, and I believe for delete_many it will raise an exception if any fail. Failing the delete silently is not a critical issue I suppose, though it will then prevent all readers from being able to set and thus subsequently get from memcache for that key until the lock expires, which takes 32s.

My suggestion would be to add noreply=False to the call to set_many() and raise a typed exception in the Futures that failed to set. The exception type would be added to the tuple of transient errors in MemcacheCache, and as such, everything will just sort of work itself out. I would also add noreply=False to the call to delete_many(), though I'm more unclear about its behavior in that case. I've not yet tried it so I do not know what exception it will throw on failure, but it could also be added to the transient errors tuple. The downside is that it does not tell you which key(s) failed so the exception will be thrown for all Futures.

@product-auto-label product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label May 18, 2021
@crwilcox crwilcox added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 18, 2021
chrisrossi pushed a commit to chrisrossi/python-ndb that referenced this issue May 28, 2021
@chrisrossi
Copy link
Contributor

@justinkwaugh The reasoning here makes total sense, but I haven't been able to reproduce the issue IRL. Maybe your test case puts a much larger strain in memcache that what I'm doing to test with a live application. If you wanted to deploy the branch in #665 and test with your application, that would be superb. Thanks!

@justinkwaugh
Copy link
Author

Thanks for the update @chrisrossi. I will try to find time to do so, but I can't guarantee reproduceability. Like most transient issues, it's hard to reproduce, though it can obviously be simulated in a test case. Basically I have a set of regression tests (~500) that run in parallel using task queues in app engine and they update a single record in the datastore on completion which stores the number of pass/fail for that run, and very conveniently it would end with cache inconsistency for that record every so often (maybe 1 in 5 times). I've monkey patched in my own fix, but regardless I will definitely take a close look at your pull request as well.

@justinkwaugh
Copy link
Author

justinkwaugh commented May 28, 2021

@chrisrossi I should also add that it's a little hard to test an individual change like this because there are so many issues which combine to cause inconsistency. #651 / #652 also contribute strongly to the problem, as they allow a reader thread to just do whatever it likes when combined with a failure to delete by the writer thread (which happens due to connection reset, or possibly a silent failure to delete, which I don't think can be detected even with noreply=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants