Cloud NDB - suppressed failures to set and delete in memcache cause cache inconsistency #656

justinkwaugh · 2021-05-18T20:05:56Z

Cloud NDB v 1.8.0, app engine standard python 3

Ok, after much wailing and gnashing of teeth, I finally got to the root of a cache inconsistency issue I was experiencing. It's transient, but happens regularly, and I finally pinned it down.

At a high level, the problem is that both the setting and the deleting of the lock value can fail silently (and will do so for the same key in the same put call more than you would think!). This completely invalidates any attempts that _datastore_api.put() is doing to maintain cache consistency, as the stale value is just left in memcache.

pymemcache by default adds noreply to many commands -- in this case the problematic ones were set_many / delete_many. It is apparently not at all unlikely for a set or delete to fail, in which case pymemcache just silently suppresses the failure. If noreply=True is added to the call, then it will return a list of keys that failed from set_many, and I believe for delete_many it will raise an exception if any fail. Failing the delete silently is not a critical issue I suppose, though it will then prevent all readers from being able to set and thus subsequently get from memcache for that key until the lock expires, which takes 32s.

My suggestion would be to add noreply=False to the call to set_many() and raise a typed exception in the Futures that failed to set. The exception type would be added to the tuple of transient errors in MemcacheCache, and as such, everything will just sort of work itself out. I would also add noreply=False to the call to delete_many(), though I'm more unclear about its behavior in that case. I've not yet tried it so I do not know what exception it will throw on failure, but it could also be added to the transient errors tuple. The downside is that it does not tell you which key(s) failed so the exception will be thrown for all Futures.

The text was updated successfully, but these errors were encountered:

Fixes googleapis#656

chrisrossi · 2021-05-28T19:40:46Z

@justinkwaugh The reasoning here makes total sense, but I haven't been able to reproduce the issue IRL. Maybe your test case puts a much larger strain in memcache that what I'm doing to test with a live application. If you wanted to deploy the branch in #665 and test with your application, that would be superb. Thanks!

justinkwaugh · 2021-05-28T20:03:08Z

Thanks for the update @chrisrossi. I will try to find time to do so, but I can't guarantee reproduceability. Like most transient issues, it's hard to reproduce, though it can obviously be simulated in a test case. Basically I have a set of regression tests (~500) that run in parallel using task queues in app engine and they update a single record in the datastore on completion which stores the number of pass/fail for that run, and very conveniently it would end with cache inconsistency for that record every so often (maybe 1 in 5 times). I've monkey patched in my own fix, but regardless I will definitely take a close look at your pull request as well.

justinkwaugh · 2021-05-28T20:25:38Z

@chrisrossi I should also add that it's a little hard to test an individual change like this because there are so many issues which combine to cause inconsistency. #651 / #652 also contribute strongly to the problem, as they allow a reader thread to just do whatever it likes when combined with a failure to delete by the writer thread (which happens due to connection reset, or possibly a silent failure to delete, which I don't think can be detected even with noreply=False)

Fixes #656

product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label May 18, 2021

crwilcox added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 18, 2021

justinkwaugh mentioned this issue May 20, 2021

Clearing the global cache at the end of a transaction is unnecessary #657

Closed

chrisrossi pushed a commit to chrisrossi/python-ndb that referenced this issue May 28, 2021

fix: detect cache write failure for MemcacheCache

5255f53

Fixes googleapis#656

chrisrossi mentioned this issue May 28, 2021

fix: detect cache write failure for MemcacheCache #665

Merged

chrisrossi closed this as completed in #665 Jun 7, 2021

chrisrossi pushed a commit that referenced this issue Jun 7, 2021

fix: detect cache write failure for MemcacheCache (#665)

5d7f163

Fixes #656

chrisrossi mentioned this issue Jul 22, 2021

Test suite for complicated concurrency/fault tolerance caching scenarios #691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud NDB - suppressed failures to set and delete in memcache cause cache inconsistency #656

Cloud NDB - suppressed failures to set and delete in memcache cause cache inconsistency #656

justinkwaugh commented May 18, 2021 •

edited

Loading

chrisrossi commented May 28, 2021

justinkwaugh commented May 28, 2021

justinkwaugh commented May 28, 2021 •

edited

Loading

Cloud NDB - suppressed failures to set and delete in memcache cause cache inconsistency #656

Cloud NDB - suppressed failures to set and delete in memcache cause cache inconsistency #656

Comments

justinkwaugh commented May 18, 2021 • edited Loading

chrisrossi commented May 28, 2021

justinkwaugh commented May 28, 2021

justinkwaugh commented May 28, 2021 • edited Loading

justinkwaugh commented May 18, 2021 •

edited

Loading

justinkwaugh commented May 28, 2021 •

edited

Loading