fs: synchronize updates to cache #289

skshetry · 2023-06-23T03:13:17Z

https://github.com/iterative/PyDrive2/pull/286/files/b9fc857613ee18cd08e32c691933a82f5f371e79#r1239272908

Looks like we need to synchronize cache updates, if we expect dirs and ids to be in sync.

shcheklein · 2023-06-23T22:05:45Z

Thanks @skshetry. I don't think this solves the race condition issue, though (unless RLock guarantees atomicity- I would be concerned about the performance then, and would be surprised). Yes, update becomes thread safe (and as we agreed before, it might be not even as issue - we don't need to try to isolate and serialize "writers", or at least it's not enough). In this case you still have an unprotected reader in the find, which still can get different sets from two subsequent calls.

We can expand the cache init to include necessary roots there (file/md5/00 .... etc) the same way it initializes 00 etc now.

skshetry · 2023-06-24T01:58:01Z

In this case you still have an unprotected reader in the find, which still can get different sets from two subsequent calls.

I think that is okay. I don't think we want to optimize this too much (the worst case is parallel calls when the cache is empty). From dvc's point of view, this should be fine, I think.

We can expand the cache init to include necessary roots there (file/md5/00 .... etc) the same way it initializes 00 etc now.

I'd say this should be fixed in dvc side by using a separate filesystem for the files/md5 remote. Then they will automatically get cached.

I would be concerned about the performance then

If you look into _ids_cache, it is behind a RLock, so the intention seems to be protecting writes/reads anyway. It's just that it's not enough here as there are two writes involved.
So, I think that was always the intention, and we are already locking every access to the cache.

shcheklein · 2023-06-24T02:05:14Z

Could you clarify then what is this RLock does for you? We might be thinking about different scenarios.

In my case I mean this:

        cached = base in self._ids_cache["dirs"]
...
        dir_ids = [self._ids_cache["ids"].copy()]

it can return cached = True, while dir_ids is still empty AFAIU.

It means that the result of the find might be empty in case where it should not be.

Are we both talking about the same case?

skshetry · 2023-06-24T02:09:57Z

@shcheklein, yes, that is what RLock ensures.

PyDrive2/pydrive2/fs/spec.py

Lines 272 to 273 in afc2e33

    
           cache["dirs"][path].append(item_id) 
        
           cache["ids"][item_id] = path

If there is base in self._ids_cache['dirs'], there must be self._ids_cache['ids'] for that base too.

shcheklein · 2023-06-24T02:18:21Z

If there is base in self._ids_cache['dirs'], there must be self._ids_cache['ids'] for that base too.

could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.

skshetry · 2023-06-24T02:23:21Z

could you elaborate on this? As far as I can tell that happens only if Python doesn't yield control to another thread while it's inside the RLock, but ASAIK it's not the case.

The other thread will be blocked and won't enter this function, until the owning thread releases the lock.

From docs on RLock/Lock:

... if this thread already owns the lock, increment the recursion level by one, and return immediately. Otherwise, if another thread owns the lock, block until the lock is unlocked. Once the lock is unlocked (not owned by any thread), then grab ownership, set the recursion level to one, and return. If more than one thread is blocked waiting until the lock is unlocked, only one at a time will be able to grab ownership of the lock. There is no return value in this case.

shcheklein · 2023-06-24T02:30:14Z

The other thread will be blocked and won't enter this function, until the owning thread releases the lock.

In the snippet that I've shared other thread is not using that function:

        cached = base in self._ids_cache["dirs"]
...
        dir_ids = [self._ids_cache["ids"].copy()]

So, even if there is a thread A that is doing an update under RLock, and thread B that is waiting to do that update, there could be a thread C that is executing the very top of the find. When thread B is in the middle of an update (just updated dirs) it can be asked (?) to yield it's control to a thread C. That can read the self._ids_cache["dirs"] while self._ids_cache["ids"]. is still empty.

skshetry · 2023-06-24T02:33:06Z

That can read the self._ids_cache["dirs"] while self._ids_cache["ids"]. is still empty.

But if we synchronized updates, there is not going to be a possibility of having dirs but not ids. They are updated at once. Or, maybe i am missing something here?

shcheklein · 2023-06-24T16:17:07Z

But if we synchronized updates, there is not going to be a possibility of having dirs but not ids. They are updated at once

That's my point / concern. I don't think there is a guarantee for that. There are lot of operations involved in updating all these maps, reading them back, etc - all of those are not atomic. You have to put readers and writes under lock or do partially-lockless (and there is a good reason for this), extra care is required in that case

shcheklein · 2023-06-24T16:40:18Z

This is guaranteed by the lock in this function,

Sorry, we go a bit in circles, but could you elaborate / confirm this? (I guess that's main thing to discuss here). The way I see it (granted, I'm not deeply familiar with CPython and haven't looked at it yet), there is not guarantee for that.

There is guarantee that no two modifications are happening simultaneously to the cache. There is no guarantee that while someone updates it, there is no way to read and get a partial update.

skshetry · 2023-06-24T16:40:28Z

Okay, I now see what you are saying. I’ll have to think about it more.

skshetry · 2023-06-26T08:59:48Z

Closing for now.

shcheklein · 2023-06-30T03:32:27Z

hey @skshetry - do we need at least a ticket for this or do you plan to address it? I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt?

shcheklein · 2023-07-08T22:48:10Z

@skshetry friendly ping.

skshetry · 2023-07-09T03:43:51Z

@shcheklein, I have to wrap all reads and writes with RLock. Will try to take a look this week.

I think I can do a PR to hard code certain roots to make DVC stable and we can get back to a proper general implementation later, wdyt

I don't think it's worth it to do this. It's 2 API calls that we are saving.

shcheklein · 2023-07-09T14:35:48Z

@skshetry sounds good.

I don't think it's worth it to do this. It's 2 API calls that we are saving.

I'm not worried about 2 API calls per session. I'm more worried about lock performance and general complexity of code.

Let's a first step please create a ticket for this, for visibility and so that we don't forget about it.

fs: synchronize updates to cache

afc2e33

skshetry requested a review from shcheklein June 23, 2023 03:13

skshetry temporarily deployed to internal June 23, 2023 03:13 — with GitHub Actions Inactive

skshetry mentioned this pull request Jun 23, 2023

fs.find: cache path ids #286

Merged

skshetry closed this Jun 26, 2023

skshetry deleted the rlock-cche-path-id branch June 26, 2023 08:59

fs: synchronize updates to cache #289

fs: synchronize updates to cache #289

Uh oh!

Conversation

skshetry commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Jun 23, 2023

Uh oh!

skshetry commented Jun 24, 2023

Uh oh!

shcheklein commented Jun 24, 2023

Uh oh!

skshetry commented Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Jun 24, 2023

Uh oh!

skshetry commented Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Jun 24, 2023

Uh oh!

skshetry commented Jun 24, 2023

Uh oh!

shcheklein commented Jun 24, 2023

Uh oh!

shcheklein commented Jun 24, 2023

Uh oh!

skshetry commented Jun 24, 2023

Uh oh!

skshetry commented Jun 26, 2023

Uh oh!

shcheklein commented Jun 30, 2023

Uh oh!

shcheklein commented Jul 8, 2023

Uh oh!

skshetry commented Jul 9, 2023

Uh oh!

shcheklein commented Jul 9, 2023

Uh oh!

Uh oh!

skshetry commented Jun 23, 2023 •

edited

Loading

skshetry commented Jun 24, 2023 •

edited

Loading

skshetry commented Jun 24, 2023 •

edited

Loading