Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TarVFile refactor #446

Closed
wants to merge 8 commits into from
Closed

TarVFile refactor #446

wants to merge 8 commits into from

Conversation

dberenbaum
Copy link
Collaborator

Extracted from #441.

Refactored TarVFile and webdataset functionalities to work with high-level File api and dropped old functionality. This has a lot of overlap with Add datachain.lib.tar.process_tar() generator #440 and remove legacy udf decorator #438.

Copy link

cloudflare-workers-and-pages bot commented Sep 13, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7498e6b
Status: ✅  Deploy successful!
Preview URL: https://0abf92b5.datachain-documentation.pages.dev
Branch Preview URL: https://tarvfile.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.04%. Comparing base (a925653) to head (7498e6b).
Report is 30 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #446      +/-   ##
==========================================
+ Coverage   86.80%   87.04%   +0.24%     
==========================================
  Files          93       92       -1     
  Lines        9800     9681     -119     
  Branches     2027     1995      -32     
==========================================
- Hits         8507     8427      -80     
+ Misses        936      906      -30     
+ Partials      357      348       -9     
Flag Coverage Δ
datachain 86.97% <96.87%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dberenbaum dberenbaum marked this pull request as ready for review September 13, 2024 20:17
@dberenbaum

This comment was marked as resolved.

@dberenbaum dberenbaum closed this Sep 16, 2024
@dberenbaum dberenbaum reopened this Sep 16, 2024
Copy link
Member

@mattseddon mattseddon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me, but I am an idiot

Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see several issues with this PR:

  • Removing FileSlice means that we lose the ability to perform random reads inside a tar archive. To read a single image inside a webdataset file, we now need to get the whole file.
  • We lose the ability to merge files and v-files in the same dataset - in fact, we're effectively doing away with the whole idea of v-files.
  • (less important) test_dir_expansion() has been quite useful in the past to debug issues with the "Unix commands".

I'm not necessarily against these changes, but they're quite significant and we need to explore the trade-offs. @dberenbaum, what issues do you see with the current TarVFile and how do you think we should implement new kinds of v-objects (e.g. image patches/bounding boxes)?

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Sep 23, 2024

@rlamy Yes, it is up to you if you think it's worth merging. You could also consider taking parts of it now or in the future.

In response to your specific concerns:

  • Removing FileSlice means that we lose the ability to perform random reads inside a tar archive. To read a single image inside a webdataset file, we now need to get the whole file.

I wonder how much the performance actually suffers here - it doesn't appear much worse in the examples. I did try to optimize this a bit so it streams the file and skips over each entry until hitting the image requested. The advantage would be that it is simpler and relies on tarfile, which also means it will support things like compression. Also, supporting other archive types could probably be achieved in a similar way easily.

However, it is possible to drop this part of the PR and bring back FileSlice along with a TarVFile.offset attribute. The rest of the PR could remain the same.

  • We lose the ability to merge files and v-files in the same dataset - in fact, we're effectively doing away with the whole idea of v-files.

Yes, I guess that's the primary benefit and drawback of the PR besides dropping a lot of code. Now that there is mature support for different File and DataModel classes, is VFile needed? IMHO the approach in this PR is more straightforward, and it will be easier to implement new v-objects and document how to do this since it only requires a new File-like class.

As you note, the drawback is the inability to merge files and v-files since they are two different DataModel classes. Is it useful to keep the raw tarfile records alongside the extracted objects? If so, is supporting two related classes in a dataset columns something that needs to be supported generally?

@dberenbaum
Copy link
Collaborator Author

Just remembered another benefit to this PR: reading image files. The current wds examples won't work for model training since they don't read the file contents as images.

@@ -206,18 +133,13 @@ def parent(self):
@contextmanager
def open(self, mode: Literal["rb", "r"] = "rb") -> Iterator[Any]:
"""Open the file and return a file object."""
if self.location:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean we can drop location field after this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it looks like we could. I opened to make #487 for that change, although since it's more "breaking," I'm not sure if/when it makes sense to merge.

@shcheklein
Copy link
Member

I like the simplification and that we are relying on the existing abstractions (and optimization can be done on top even in this PR - model itself can stay simpler).

We lose the ability to merge files and v-files in the same dataset

@rlamy could you clarify please? what is the scenario / use case for this?

@dberenbaum dberenbaum mentioned this pull request Sep 30, 2024
@shcheklein
Copy link
Member

We lose the ability to merge files and v-files in the same dataset

@rlamy could you clarify please? what is the scenario / use case for this?

@rlamy
Copy link
Contributor

rlamy commented Oct 2, 2024

We lose the ability to merge files and v-files in the same dataset

@rlamy could you clarify please? what is the scenario / use case for this?

An example would look like:

base = DataChain.from_storage("s3://webdataset-stuff").gen(...)
extra = DataChain.from_storage("gs://raw-images").filter(C.path.glob("*.png").map(...)
combined = (base | extra).filter(...)

I agree that using .location is clumsy, but I think that the ability to combine data from different sources without copies is essential functionality.

Now, .location is actually a work-around for the fact that we only support unions when the data models are exactly the same. So if we want to move on with this PR, I think we need to find a way to make DataChain.union() compatible with model inheritance (i.e. the schema of the union is the common base of the schemas, calling collect() on the union returns the original objects, etc.)

@dberenbaum
Copy link
Collaborator Author

Closing for now since I don't see it progressing.

@dberenbaum dberenbaum closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants