Skip to content

fs: synchronize updates to cache #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

skshetry
Copy link
Contributor

@skshetry skshetry commented Jun 23, 2023

https://github.com/iterative/PyDrive2/pull/286/files/b9fc857613ee18cd08e32c691933a82f5f371e79#r1239272908

Looks like we need to synchronize cache updates, if we expect dirs and ids to be in sync.

@skshetry skshetry requested a review from shcheklein June 23, 2023 03:13
@skshetry skshetry temporarily deployed to internal June 23, 2023 03:13 — with GitHub Actions Inactive
@skshetry skshetry mentioned this pull request Jun 23, 2023
@shcheklein
Copy link
Member

Thanks @skshetry. I don't think this solves the race condition issue, though (unless RLock guarantees atomicity- I would be concerned about the performance then, and would be surprised). Yes, update becomes thread safe (and as we agreed before, it might be not even as issue - we don't need to try to isolate and serialize "writers", or at least it's not enough). In this case you still have an unprotected reader in the find, which still can get different sets from two subsequent calls.

We can expand the cache init to include necessary roots there (file/md5/00 .... etc) the same way it initializes 00 etc now.

@skshetry
Copy link
Contributor Author

In this case you still have an unprotected reader in the find, which still can get different sets from two subsequent calls.

I think that is okay. I don't think we want to optimize this too much (the worst case is parallel calls when the cache is empty). From dvc's point of view, this should be fine, I think.

We can expand the cache init to include necessary roots there (file/md5/00 .... etc) the same way it initializes 00 etc now.

I'd say this should be fixed in dvc side by using a separate filesystem for the files/md5 remote. Then they will automatically get cached.


I would be concerned about the performance then

If you look into _ids_cache, it is behind a RLock, so the intention seems to be protecting writes/reads anyway. It's just that it's not enough here as there are two writes involved.
So, I think that was always the intention, and we are already locking every access to the cache.

@shcheklein
Copy link
Member

Could you clarify then what is this RLock does for you? We might be thinking about different scenarios.

In my case I mean this:

        cached = base in self._ids_cache["dirs"]
...
        dir_ids = [self._ids_cache["ids"].copy()]

it can return cached = True, while dir_ids is still empty AFAIU.

It means that the result of the find might be empty in case where it should not be.

Are we both talking about the same case?

@skshetry
Copy link
Contributor Author

skshetry commented Jun 24, 2023

@shcheklein, yes, that is what RLock ensures.

cache["dirs"][path].append(item_id)
cache["ids"][item_id] = path

If there is base in self._ids_cache['dirs'], there must be self._ids_cache['ids'] for that base too.

@shcheklein
Copy link
Member

If there is base in self._ids_cache['dirs'], there must be self._ids_cache['ids'] for that base too.

could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.

@skshetry
Copy link
Contributor Author

skshetry commented Jun 24, 2023

could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.

The other thread will be blocked and won't enter this function, until the owning thread releases the lock.

From docs on RLock/Lock:

... if this thread already owns the lock, increment the recursion level by one, and return immediately. Otherwise, if another thread owns the lock, block until the lock is unlocked. Once the lock is unlocked (not owned by any thread), then grab ownership, set the recursion level to one, and return. If more than one thread is blocked waiting until the lock is unlocked, only one at a time will be able to grab ownership of the lock. There is no return value in this case.

@shcheklein
Copy link
Member

The other thread will be blocked and won't enter this function, until the owning thread releases the lock.

In the snippet that I've shared other thread is not using that function:

        cached = base in self._ids_cache["dirs"]
...
        dir_ids = [self._ids_cache["ids"].copy()]

So, even if there is a thread A that is doing an update under RLock, and thread B that is waiting to do that update, there could be a thread C that is executing the very top of the find. When thread B is in the middle of an update (just updated dirs) it can be asked (?) to yield it's control to a thread C. That can read the self._ids_cache["dirs"] while self._ids_cache["ids"]. is still empty.

@skshetry
Copy link
Contributor Author

That can read the self._ids_cache["dirs"] while self._ids_cache["ids"]. is still empty.

But if we synchronized updates, there is not going to be a possibility of having dirs but not ids. They are updated at once. Or, maybe i am missing something here?

@shcheklein
Copy link
Member

But if we synchronized updates, there is not going to be a possibility of having dirs but not ids. They are updated at once

That's my point / concern. I don't think there is a guarantee for that. There are lot of operations involved in updating all these maps, reading them back, etc - all of those are not atomic. You have to put readers and writes under lock or do partially-lockless (and there is a good reason for this), extra care is required in that case

@shcheklein
Copy link
Member

This is guaranteed by the lock in this function,

Sorry, we go a bit in circles, but could you elaborate / confirm this? (I guess that's main thing to discuss here). The way I see it (granted, I'm not deeply familiar with CPython and haven't looked at it yet), there is not guarantee for that.

There is guarantee that no two modifications are happening simultaneously to the cache. There is no guarantee that while someone updates it, there is no way to read and get a partial update.

@skshetry
Copy link
Contributor Author

Okay, I now see what you are saying. I’ll have to think about it more.

@skshetry skshetry closed this Jun 26, 2023
@skshetry skshetry deleted the rlock-cche-path-id branch June 26, 2023 08:59
@skshetry
Copy link
Contributor Author

Closing for now.

@shcheklein
Copy link
Member

hey @skshetry - do we need at least a ticket for this or do you plan to address it? I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt?

@shcheklein
Copy link
Member

@skshetry friendly ping.

@skshetry
Copy link
Contributor Author

skshetry commented Jul 9, 2023

@shcheklein, I have to wrap all reads and writes with RLock. Will try to take a look this week.

I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt

I don't think it's worth it to do this. It's 2 API calls that we are saving.

@shcheklein
Copy link
Member

@skshetry sounds good.

I don't think it's worth it to do this. It's 2 API calls that we are saving.

I'm not worried about 2 API calls per session. I'm more worried about lock performance and general complexity of code.

Let's a first step please create a ticket for this, for visibility and so that we don't forget about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants