-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TarVFile
refactor
#446
TarVFile
refactor
#446
Conversation
296c136
to
0531f35
Compare
Deploying datachain-documentation with Cloudflare Pages
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #446 +/- ##
==========================================
+ Coverage 86.80% 87.04% +0.24%
==========================================
Files 93 92 -1
Lines 9800 9681 -119
Branches 2027 1995 -32
==========================================
- Hits 8507 8427 -80
+ Misses 936 906 -30
+ Partials 357 348 -9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
This comment was marked as resolved.
This comment was marked as resolved.
841d616
to
3993d42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine to me, but I am an idiot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see several issues with this PR:
- Removing
FileSlice
means that we lose the ability to perform random reads inside a tar archive. To read a single image inside a webdataset file, we now need to get the whole file. - We lose the ability to merge files and v-files in the same dataset - in fact, we're effectively doing away with the whole idea of v-files.
- (less important)
test_dir_expansion()
has been quite useful in the past to debug issues with the "Unix commands".
I'm not necessarily against these changes, but they're quite significant and we need to explore the trade-offs. @dberenbaum, what issues do you see with the current TarVFile
and how do you think we should implement new kinds of v-objects (e.g. image patches/bounding boxes)?
@rlamy Yes, it is up to you if you think it's worth merging. You could also consider taking parts of it now or in the future. In response to your specific concerns:
I wonder how much the performance actually suffers here - it doesn't appear much worse in the examples. I did try to optimize this a bit so it streams the file and skips over each entry until hitting the image requested. The advantage would be that it is simpler and relies on However, it is possible to drop this part of the PR and bring back
Yes, I guess that's the primary benefit and drawback of the PR besides dropping a lot of code. Now that there is mature support for different As you note, the drawback is the inability to merge files and v-files since they are two different |
Just remembered another benefit to this PR: reading image files. The current wds examples won't work for model training since they don't read the file contents as images. |
@@ -206,18 +133,13 @@ def parent(self): | |||
@contextmanager | |||
def open(self, mode: Literal["rb", "r"] = "rb") -> Iterator[Any]: | |||
"""Open the file and return a file object.""" | |||
if self.location: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean we can drop location field after this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it looks like we could. I opened to make #487 for that change, although since it's more "breaking," I'm not sure if/when it makes sense to merge.
I like the simplification and that we are relying on the existing abstractions (and optimization can be done on top even in this PR - model itself can stay simpler).
@rlamy could you clarify please? what is the scenario / use case for this? |
@rlamy could you clarify please? what is the scenario / use case for this? |
An example would look like: base = DataChain.from_storage("s3://webdataset-stuff").gen(...)
extra = DataChain.from_storage("gs://raw-images").filter(C.path.glob("*.png").map(...)
combined = (base | extra).filter(...) I agree that using Now, |
Closing for now since I don't see it progressing. |
Extracted from #441.
Refactored
TarVFile
and webdataset functionalities to work with high-levelFile
api and dropped old functionality. This has a lot of overlap with Adddatachain.lib.tar.process_tar()
generator #440 and remove legacy udf decorator #438.