`TarVFile` refactor #446

dberenbaum · 2024-09-13T19:43:25Z

Extracted from #441.

Refactored TarVFile and webdataset functionalities to work with high-level File api and dropped old functionality. This has a lot of overlap with Add datachain.lib.tar.process_tar() generator #440 and remove legacy udf decorator #438.

cloudflare-workers-and-pages · 2024-09-13T19:45:11Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`7498e6b`
Status:	✅ Deploy successful!
Preview URL:	https://0abf92b5.datachain-documentation.pages.dev
Branch Preview URL:	https://tarvfile.datachain-documentation.pages.dev

View logs

codecov · 2024-09-13T19:54:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.04%. Comparing base (a925653) to head (7498e6b).
Report is 30 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #446      +/-   ##
==========================================
+ Coverage   86.80%   87.04%   +0.24%     
==========================================
  Files          93       92       -1     
  Lines        9800     9681     -119     
  Branches     2027     1995      -32     
==========================================
- Hits         8507     8427      -80     
+ Misses        936      906      -30     
+ Partials      357      348       -9

Flag	Coverage Δ
datachain	`86.97% <96.87%> (+0.22%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mattseddon

looks fine to me, but I am an idiot

rlamy

I see several issues with this PR:

Removing FileSlice means that we lose the ability to perform random reads inside a tar archive. To read a single image inside a webdataset file, we now need to get the whole file.
We lose the ability to merge files and v-files in the same dataset - in fact, we're effectively doing away with the whole idea of v-files.
(less important) test_dir_expansion() has been quite useful in the past to debug issues with the "Unix commands".

I'm not necessarily against these changes, but they're quite significant and we need to explore the trade-offs. @dberenbaum, what issues do you see with the current TarVFile and how do you think we should implement new kinds of v-objects (e.g. image patches/bounding boxes)?

dberenbaum · 2024-09-23T21:09:29Z

@rlamy Yes, it is up to you if you think it's worth merging. You could also consider taking parts of it now or in the future.

In response to your specific concerns:

Removing FileSlice means that we lose the ability to perform random reads inside a tar archive. To read a single image inside a webdataset file, we now need to get the whole file.

I wonder how much the performance actually suffers here - it doesn't appear much worse in the examples. I did try to optimize this a bit so it streams the file and skips over each entry until hitting the image requested. The advantage would be that it is simpler and relies on tarfile, which also means it will support things like compression. Also, supporting other archive types could probably be achieved in a similar way easily.

However, it is possible to drop this part of the PR and bring back FileSlice along with a TarVFile.offset attribute. The rest of the PR could remain the same.

We lose the ability to merge files and v-files in the same dataset - in fact, we're effectively doing away with the whole idea of v-files.

Yes, I guess that's the primary benefit and drawback of the PR besides dropping a lot of code. Now that there is mature support for different File and DataModel classes, is VFile needed? IMHO the approach in this PR is more straightforward, and it will be easier to implement new v-objects and document how to do this since it only requires a new File-like class.

As you note, the drawback is the inability to merge files and v-files since they are two different DataModel classes. Is it useful to keep the raw tarfile records alongside the extracted objects? If so, is supporting two related classes in a dataset columns something that needs to be supported generally?

dberenbaum · 2024-09-24T02:38:02Z

Just remembered another benefit to this PR: reading image files. The current wds examples won't work for model training since they don't read the file contents as images.

shcheklein · 2024-09-30T00:45:16Z

src/datachain/lib/file.py

@@ -206,18 +133,13 @@ def parent(self):
    @contextmanager
    def open(self, mode: Literal["rb", "r"] = "rb") -> Iterator[Any]:
        """Open the file and return a file object."""
-        if self.location:


does it mean we can drop location field after this PR?

Yup, it looks like we could. I opened to make #487 for that change, although since it's more "breaking," I'm not sure if/when it makes sense to merge.

src/datachain/lib/file.py

src/datachain/lib/tar.py

src/datachain/lib/file.py

shcheklein · 2024-09-30T01:43:09Z

I like the simplification and that we are relying on the existing abstractions (and optimization can be done on top even in this PR - model itself can stay simpler).

We lose the ability to merge files and v-files in the same dataset

@rlamy could you clarify please? what is the scenario / use case for this?

shcheklein · 2024-10-01T01:14:13Z

We lose the ability to merge files and v-files in the same dataset

@rlamy could you clarify please? what is the scenario / use case for this?

rlamy · 2024-10-02T12:31:52Z

We lose the ability to merge files and v-files in the same dataset

@rlamy could you clarify please? what is the scenario / use case for this?

An example would look like:

base = DataChain.from_storage("s3://webdataset-stuff").gen(...)
extra = DataChain.from_storage("gs://raw-images").filter(C.path.glob("*.png").map(...)
combined = (base | extra).filter(...)

I agree that using .location is clumsy, but I think that the ability to combine data from different sources without copies is essential functionality.

Now, .location is actually a work-around for the fact that we only support unions when the data models are exactly the same. So if we want to move on with this PR, I think we need to find a way to make DataChain.union() compatible with model inheritance (i.e. the schema of the union is the common base of the schemas, calling collect() on the union returns the original objects, etc.)

dberenbaum · 2024-10-15T00:41:31Z

Closing for now since I don't see it progressing.

dberenbaum requested review from rlamy and mattseddon September 13, 2024 19:43

refactor tarvfile

0531f35

dberenbaum force-pushed the tarvfile branch from 296c136 to 0531f35 Compare September 13, 2024 19:45

dberenbaum marked this pull request as ready for review September 13, 2024 20:17

merge main

bde4a3e

This comment was marked as resolved.

Sign in to view

dberenbaum closed this Sep 16, 2024

rename tarvfile field

d8c51c5

dberenbaum reopened this Sep 16, 2024

Dave Berenbaum added 2 commits September 16, 2024 07:46

use posix path for windows

93d8bd9

optimize tar read

3993d42

dberenbaum force-pushed the tarvfile branch from 841d616 to 3993d42 Compare September 16, 2024 23:39

Dave Berenbaum added 2 commits September 20, 2024 14:47

merge main

3cf7fbc

drop test_dir_expansion

6251046

mattseddon approved these changes Sep 23, 2024

View reviewed changes

rlamy suggested changes Sep 23, 2024

View reviewed changes

shcheklein reviewed Sep 30, 2024

View reviewed changes

src/datachain/lib/file.py Outdated Show resolved Hide resolved

shcheklein reviewed Sep 30, 2024

View reviewed changes

src/datachain/lib/tar.py Show resolved Hide resolved

shcheklein reviewed Sep 30, 2024

View reviewed changes

src/datachain/lib/file.py Show resolved Hide resolved

dberenbaum mentioned this pull request Sep 30, 2024

drop file.location #487

Closed

revert ImageFile.read()

7498e6b

dberenbaum closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TarVFile` refactor #446

`TarVFile` refactor #446

dberenbaum commented Sep 13, 2024

cloudflare-workers-and-pages bot commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

This comment was marked as resolved.

mattseddon left a comment

rlamy left a comment

dberenbaum commented Sep 23, 2024 •

edited

Loading

dberenbaum commented Sep 24, 2024

shcheklein Sep 30, 2024

dberenbaum Sep 30, 2024

shcheklein commented Sep 30, 2024

shcheklein commented Oct 1, 2024

rlamy commented Oct 2, 2024

dberenbaum commented Oct 15, 2024

TarVFile refactor #446

TarVFile refactor #446

Conversation

dberenbaum commented Sep 13, 2024

cloudflare-workers-and-pages bot commented Sep 13, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

This comment was marked as resolved.

mattseddon left a comment

Choose a reason for hiding this comment

rlamy left a comment

Choose a reason for hiding this comment

dberenbaum commented Sep 23, 2024 • edited Loading

dberenbaum commented Sep 24, 2024

shcheklein Sep 30, 2024

Choose a reason for hiding this comment

dberenbaum Sep 30, 2024

Choose a reason for hiding this comment

shcheklein commented Sep 30, 2024

shcheklein commented Oct 1, 2024

rlamy commented Oct 2, 2024

dberenbaum commented Oct 15, 2024

`TarVFile` refactor #446

`TarVFile` refactor #446

cloudflare-workers-and-pages bot commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

dberenbaum commented Sep 23, 2024 •

edited

Loading