Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearing the global cache at the end of a transaction is unnecessary #657

Closed
chrisrossi opened this issue May 19, 2021 · 6 comments
Closed
Assignees
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@chrisrossi
Copy link
Contributor

There is code which, after committing a transaction, clears keys from the global_cache if they are used during that transaction. After looking at this code a little more closely, I've concluded this is unnecessary.

The code in a question is the method, google.cloud.ndb.context._Context._clear_global_cache which is called from google.cloud.ndb._transaction._transaction_async. What it does is look to see what keys are present in the local, in memory cache, which is only used for the transaction, and then deletes all of those keys in the global cache.

Running through the possible scenarios, I'm not seeing a good reason to do this:

Scenario 1: Entity lookup. If an entity lookup is done during a transaction, the global cache is not used at all. It is neither queried nor written to. So there is no impact on the global cache and clearing these keys is unnecessary.

Scenario 2: Entity put. When an entity is written to Datastore, whether or not in a transaction, that entity's key is deleted from the global cache. Deleting the same key again in the call to _clear_global_cache is redundant.

Scenario 3: Entity delete. This is essentially the same as Scenario 2. The key has already been deleted before the call to _clear_global_cache, so deleting it again is redundant.

I believe these are all the scenarios.

What's the harm in calling _clear_global_cache anyway?

Recently a spate of issues has had me looking at fault tolerance and how we handle transient errors. Specifically, in the work leading up to #654, I had to look at everywhere the cache was being used and reason about what should happen if a particular call to the cache fails. In the case of _clear_global_cache, I think that if weren't redundant code, then we'd be in a pickle if those delete keys failed, as it would leave stale data in the cache and we'd have to figure out how to mitigate the damage from that.

The finding that this code is unnecessary in the first place is fortunate, because now we don't have to reason about fault tolerance for it.If we leave it in, we remain with the burden of maintaining it and reasoning about what should happen in the event of a communication failure with the cache. Luckily, it can be removed, which is a much better outcome.

@chrisrossi chrisrossi added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 19, 2021
@chrisrossi chrisrossi self-assigned this May 19, 2021
@product-auto-label product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label May 19, 2021
chrisrossi pushed a commit to chrisrossi/python-ndb that referenced this issue May 19, 2021
Code that cleared the global cache at the end of a transaction was found
to be unnecessary.

Fixes googleapis#657
@justinkwaugh
Copy link

justinkwaugh commented May 20, 2021

@chrisrossi @crwilcox I do not believe this is correct at all. This is the only correct way to clear the locked keys when in a transaction, because you want the unlocking to happen along the transaction boundary. For scenario 1 yes, it's not necessary and in fact in theory is bad. For scenario 2 your reasoning is that it is already deleting regardless of transactionality, but that is an actual bug as it is incorrect to be unlocking the keys mid transaction. That is a regression from legacy NDB, and can lead to cache inconsistencies. See #650.

There is nothing wrong with these delete calls failing on put because there can never be stale values in the cache, there are only lock values (once #656 is fixed). Those are fine to leave in the cache as they will expire after 32s if the delete fails. I definitely agree that keys which were only looked up do not need to be deleted here, and in fact can possibly lead to cache inconsistencies. A list of keys whose database records were mutated during the transaction are the correct set of keys to be deleted.

I have concerns that these changes keep being made which seem to indicate an incorrect understanding of how the library is trying to maintain cache consistency.

@justinkwaugh
Copy link

justinkwaugh commented May 20, 2021

Maybe it would help to layout the correct flow of memcache operations during a transaction with multiple records being modified:

  1. Transaction starts
  2. Key A.get() is called
  3. Key A gets looked up directly from the db, memcache is not used because we are in a transaction, but the value read from the db will be stored in the context cache
  4. Model A.put() is called
  5. Memcache set is called with a lock value for key A. The result of this memcache operation needs to be checked, and on failure, needs to be retried like other transient failures, but ultimately if it does not succeed an exception needs to be raised and the transaction failed (assuming strict write mode). On success, now there is a lock value in memcache with an expiration of 32s. This will prevent any other read threads from attempting to modify the cache with stale values.
  6. The datastore put operation happens
  7. Now repeat 2-6 for Key/Model B
  8. Key C.get() is called, and like step 3 no memcache is used, but the value goes in the context cache
  9. The datastore commit operation happens
  10. Now Key A and Key B lock values need to be deleted from memcache. Key C was only looked up and should not be deleted. If the delete fails, the downside is that memcache cannot be updated by a read thread for 32s, which is ok. The delete should happen as quickly as possible at this point to allow the read threads to update the value.
  11. The transaction context ends, updating the parent context cache

This is how things should work. The most critically important thing is step 5 which is not correct right now due to bug #656. There are caveats to step 5 and 9 that currently are not enough to have perfect cache consistency (see #653) but it's close. It is also very important that the lock is held for the duration of the transaction, which is currently bug #650, and which is being broken by the change in this ticket. The tasklets for Step 10 are not being flushed so may not happen quickly which is bug #655. Additionally as mentioned, in step 10 only keys of mutated objects should be deleted which currently has no bug ticket.

@justinkwaugh
Copy link

justinkwaugh commented May 20, 2021

Now here is the correct flow for a read thread:

  1. Key A.get() is called
  2. Key A is looked up in the context cache, if a value is present it is returned, otherwise continue
  3. Key A is looked up in memcache, if a value is present it is returned, otherwise continue
  4. memcache.add() is called to put a lock value in memcache. This will fail if any value is already present in the cache (i.e. a lock value or model value written from another thread) or on a transient failure (no actual need to retry in this case).
  5. If step 4 fails memcache should not be interacted with for the rest of the operation
  6. Now Key A is watched. In memcache terms this means the key is looked up again but with cas id returned using gets()
  7. The datastore get operation happens and the context cache is updated with the read value
  8. Assuming step 4 succeeded and a model was read from the db now the value is placed into the cache using cas. This will fail if any other thread (a write thread) has modified that key in memcache. If a value was not read, the keys are unwatched (which just clears the cas id from the thread local in MemcacheCache)

In this flow, step 4 is very important which currently is not correct due to bugs #651 and #652. It is the combination of the memcache add in step 4 with the memcache cas call in 8 that allows for cache consistency.

Obviously this is all memcache centric, but the same applies for redis. Step 4 would be something like SETNX in redis terms.

@chrisrossi
Copy link
Contributor Author

Hi @justinkwaugh ,

I think the point still stands that with the code in its current state, this code doesn't really do anything. It looks like we may have to do something similar to this for #650, but it wouldn't be exactly this, because this code deletes keys that were only read during the transaction. So, we'll deal with that in the scope of #650.

This will be an iterative process.

Thank you for bearing with us.

@justinkwaugh
Copy link

justinkwaugh commented May 20, 2021

@chrisrossi I hear what you are saying, however deleting immediately after put and not at the end of the transaction is in my opinion a worse incorrect state to be in than deleting the read only keys. The title of the ticket and the argument within is that clearing the global cache after the transaction is unnecessary because it is already being done. But that's exactly backwards from reality. Clearing the global cache after the transaction is the only time it is necessary in a transactional flow, it just needs to be more restrictive. So you are planning to remove a needed feature because a bug elsewhere exists allowing it to sort of work, so that later you can fix that bug and then add this feature back in.

#650 is meant to be a narrowly focused one line bug fix to address a regression from legacy NDB.

The deleting of read only keys is how the legacy NDB works, so it is a bug there as well. It makes more sense to fix #650 to bring it inline with legacy NDB and then to address a new ticket which would be to make sure that this clear_global_cache call which still has to be done in the place it is currently doing it is more restrictive about which keys it deletes.

I'm all for iterative, but there is an order of criticality and likelihood of causing issues that needs to be looked at. For instance, #656 is actually causing cache inconsistencies on a regular basis for me in real usage causing breakages and the library with the bugs as-is is a showstopper for me.

Cache consistency is also a holistic thing. I have made several bug tickets to point out specific bugs, but the solution needs to be looked at more broadly when thinking about how to prioritize and address the issues. An example is your argument "then we'd be in a pickle if those delete keys failed, as it would leave stale data in the cache". If bugs #656 and #651 are fixed, then the likelihood of that is limited to various edge cases as listed in some of the other lower priority cache inconsistency bugs, and goes to zero if all the rest are fixed. Your arguments have also stated that the cache clearing is leading to problematic cache inconsistencies when in reality it is one of the lowest of likelihood to do so.

A more reasonable order of bug fixes which would actually iteratively improve the cache consistency would be:
#656
#651
#652
#650
(fix the issue of the clear cache deleting read only keys here)
#653
#655

chrisrossi pushed a commit to chrisrossi/python-ndb that referenced this issue May 20, 2021
When in a transaction, keys should only be cleared from the global cache
after transaction has been committed.

Fixes googleapis#650 googleapis#657
chrisrossi pushed a commit that referenced this issue Jun 7, 2021
* fix: defer clearing global cache when in transaction

When in a transaction, keys should only be cleared from the global cache
after the transaction has been committed.

Fixes #650 #657
@chrisrossi
Copy link
Contributor Author

Closed by #660

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants