Conversation
💊 CI failures summary and remediationsAs of commit 91bd94f (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
@pmeier mind checking this PR and letting me know if I'm headed in the right direction? |
pmeier
left a comment
There was a problem hiding this comment.
Argh, I knew there was a reason I marked SBU in the tracker issue. When I gave you the go I just saw that there is something to download and assumed I was mistaken before. That is on me, my bad.
In general your solution looks really good and works, but has one major downside: We need to re-download every image for every iteration. While we plan to support streaming datasets from the internet, we are not there yet. Thus, we need to download everything.
That could be achieved with a OnDiskCacheHolder, but that would mean we would only download at runtime. All current datasets download everything upfront and I would keep it that way for now.
My solution is to put a custom preprocess method onto the resource and download everything there.
| with open(dataset_folder.joinpath(photo_urls_file), "w") as url_file, open( | ||
| dataset_folder.joinpath(photo_captions_file), "w" | ||
| ) as caption_file: | ||
| urls = [f"https://via.placeholder.com/{random.randint(100, 1000)}.jpg" for _ in range(num_samples)] |
There was a problem hiding this comment.
This is a really cool idea and I'm definitely going to use this webiste for other things in the future 🚀 Unfortunately, we cannot have an actual download during mock data generation for two reasons:
- Downloading these images takes quite some time and we want the tests to be fast.
- Meta internal test system do not have access to the internet and thus would fail here.
I propose I send a patch for the test suite that allows us to also only generate the already preprocessed files. Thus, we only add a SBUCaptionedPhotoDataset that already includes test images. I'll ping you on the PR.
There was a problem hiding this comment.
@pmeier I'll wait for the PR to get merged right? I can make the necessary changes after it.
There was a problem hiding this comment.
Yes, sorry for the delay. I'll try to get it merged soon.
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
34b9775 to
91bd94f
Compare
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
fixes #5349