Description
🐛 Bug
Hello! Excited about making use of litdata.optimize to solve some long standing data loading issues Ive encountered many times before.
While trying to get it working with my dataset, I tried running through the getting started example that creates binary files of a fake dataset consisting of random noise images, but pretty quickly ran into an error that should've been caught.
Running litdata.optimize()
runs without an issue, and the streaming.py
also technically runs without issue. But, if I try to do anything with the DataLoader defined at the end of streaming.py
, I get a default_collate
error because it is incapable of batching PIL.Images.
Given that pretty much all of the examples I could find in the documentation rely heavily on PIL.Image, I'm wondering
A. Can we update the getting started example so that it at least defines a DataLoader that wont immediately throw an error?
B. Can we get some more explanation/exploration into whether it's best to save images as PIL.Image within the optimized files, or if it's better to convert them to torch.Tensor/np.Arrays in advance so they're not repeatedly converted every epoch (My intuition is leaning towards the latter)
To Reproduce
If desired, I can provide more details on the environment I used to produce this error, but given that it's already based on the minimal working example I anticipate it shouldnt require much imagination to reproduce.