fs: implement a fsspec-based filesystem backend #119

isidentical · 2021-08-03T12:08:05Z

This basically moves the already existing functionality from DVC to the PyDrive2. There wasn't any major changes, beside:

Opening files in the w mode is now supported, but we first collect everything and then write so the functionality should be used carefully. It was mainly needed for the tests.
Implementation of cp_file(), which basically does upload_fobj() to copy. And the move() method is now implemented through cp + rm.
A couple places used to pass size as string, now they are casted to integers
info() calls now have checksum field
get_file/put_file APIs now support fsspec.Callbacks.

Other changes (those I can remember):

Paths are now strings
ls()/find() now return lists instead of generators

Fixes #113

efiop · 2021-08-03T12:22:39Z

Oh, I totally forgot that PyDrive2 is still using travis 🙁 No wonder tests are not running in this PR.

efiop · 2021-08-03T12:25:53Z

@isidentical Would you be willing to convert ci to github actions? Maybe as a separate pre-requisite PR. Seems like the effort would be equal or lower than resurrecting travis.

isidentical · 2021-08-03T12:27:42Z

Yeah sure.

…

On Tue, Aug 3, 2021, 3:26 PM Ruslan Kuprieiev ***@***.***> wrote: @isidentical <https://github.com/isidentical> Would you be willing to convert ci to github actions? Maybe as a separate pre-requisite PR. Seems like the effort would be equal or lower than resurrecting travis. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#119 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALJKHQMJZRPA3MN6PMTQOUTT27N5XANCNFSM5BOWJLKQ> .

setup.py

pydrive2/fs_utils.py

pydrive2/fs.py

shcheklein · 2021-08-04T05:33:51Z

pydrive2/fs.py

+        else:
+            return parts[0], ""
+
+    @wrap_prop(threading.RLock())


not sure if we have that case here, but if multiple threads are reading this - would be beneficial to use write local (allow multiple readers) if cache is already populated. Not sure though if Python has it.

shcheklein · 2021-08-04T05:37:54Z

pydrive2/fs.py

+    def flush(self):
+        self.buffer.flush()
+        try:
+            self.fs.upload_fobj(self.buffer, self.path)


just to double check - we need this to be append semantics, right? does it work this way?

we need this to be append semantics, right? does it work this way?

No. This is just a simple wrapper around upload_fobj() that buffers all your write() calls and dispatches them at once. Even after flush() we forcefully close the file so that you can not make any changes that might lead you to expected append() like semantics.

I see. A few questions here:

should we test self._closed in write and raise if it's True? It's not a very common behavior to close on flush right?

self.buffer.flush() do we need this?

what will be happening on gdrive_retry? do we handle this at all in the place where we depend on this? (retries are important usually for tests to be stable)

? should we test self._closed in write and raise if it's True? It's not a very common behavior to close on flush right?

buffer.flush() actually checks for that, so that we don't have to manage individual states

self.buffer.flush() do we need this?

answered above

what will be happening on gdrive_retry? do we handle this at all in the place where we depend on this? (retries are important usually for tests to be stable)

upload_fobj() is retried.

buffer.flush() actually checks for that,

could clarify please? (I'm not sure we are on the same page :) )

upload_fobj() is retried.

yep, I see, gdrive_upload_fobj is retried inside the upload_fobj

could clarify please? (I'm not sure we are on the same page :) )

When we close our wrapper file (via .close()), we also close the buffer (we call buffer.close()). And if we try to call buffer.flush()after we close the file (and due to that, the buffer itself) it will raise the properI/O operation on closed file` error.

hmm ... it looks I am missing something still, sorry :)

My concern was about the following workflow:

writer.flush() # it closes itself which is not expected (usually) writer.write("more stuff") # how does it behave now? should we signal that we are trying to write to a closed object?

You can't do that with the current approach. We only allow flush()ing once, and after the flush() we close the file (which is a bit unorthodox, though was the easiest route to emulate the behavior since generally you only do it once during the close() itself so you don't actually need the file to be opened for the rest of the flow).

Yep, so going back to the initial question "should we signal that we are trying to write to a closed object?" ?

I think the answer is that, we already do at the first line of this function.

>>> import io >>> buf = io.BytesIO() >>> buf.write(b"hey") # first write data 3 >>> buf.flush(); buf.close() # then flush + close the file >>> buf.flush() # try flushing again Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: I/O operation on closed file.

When you write stuff, and then call flush() of our wrapper (which also calls close() as I stated), then the next flush automatically is going to signal that since the first line is buf.flush() which would raise the appropriate error.

pydrive2/fs.py

efiop · 2021-08-06T21:42:45Z

@Mergifyio rebase

mergify · 2021-08-06T21:43:29Z

Command rebase: failure

Base branch update has failed
Git reported the following error:
Rebasing (1/4)

error: could not apply 3be2a30... fs: implement a fsspec-based filesystem backend

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 3be2a30... fs: implement a fsspec-based filesystem backend
CONFLICT (modify/delete): scripts/ci/install.sh deleted in HEAD and modified in 3be2a30 (fs: implement a fsspec-based filesystem backend). Version 3be2a30 (fs: implement a fsspec-based filesystem backend) of scripts/ci/install.sh left in tree.
err-code: DD704

setup.py

efiop · 2021-08-06T23:10:30Z

@Mergifyio rebase

mergify · 2021-08-06T23:11:03Z

Command rebase: failure

Pull request can't be updated with latest base branch changes
GitHub App like Mergify are not allowed to rebase pull request where .github/workflows is changed.
This pull request must be rebased manually.
err-code: 83DB3

shcheklein

One minor discussion that is still happening, not a blocker

casperdcl · 2021-09-13T18:09:50Z

pydrive2/fs/spec.py

+        with self.open(lpath) as stream:
+            # IterStream objects doesn't support full-length
+            # seek() calls, so we have to wrap the data with
+            # an external buffer.
+            buffer = io.BytesIO(stream.read())
+            self.upload_fobj(buffer, rpath)


I don't get this. This will fail if a file is bigger than the system RAM.

Suggested change

with self.open(lpath) as stream:

# IterStream objects doesn't support full-length

# seek() calls, so we have to wrap the data with

# an external buffer.

buffer = io.BytesIO(stream.read())

self.upload_fobj(buffer, rpath)

with self.open(lpath) as stream:

self.upload_fobj(stream, rpath)

and fix self.open to return a proper stream.

isidentical changed the title ~~fs: implement a fsspec-based filesystem backend~~ [WIP] fs: implement a fsspec-based filesystem backend Aug 3, 2021

efiop reviewed Aug 3, 2021

View reviewed changes

setup.py Outdated Show resolved Hide resolved

efiop reviewed Aug 3, 2021

View reviewed changes

pydrive2/fs_utils.py Outdated Show resolved Hide resolved

shcheklein reviewed Aug 4, 2021

View reviewed changes