-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using a streaming dataloader with an unbalanced dataset yields unexpected batch sizes. #199
Comments
Hey @esivonxay-cognitiv, Thanks for the reproducible script. I will have a look into it. |
Thanks Thomas! |
Hey @esivonxay-cognitiv I am curious, what's your interest and usage of LitData ? |
Yeah, I'm interested in LitData primarily for the ability to sample from multiple streams. I've got 2 datasets which are quite imbalanced (one is 100,000x larger than the other) and I'm trying to downsample one dataset to reduce the imbalance by a couple orders of magnitude. Naively, I could do this when constructing the dataset by throwing out datapoints. However, doing so will result in me throwing out 90 or 99% of the data (to decrease the imbalance by 10x or 100x, respectively). It's possible that important samples may be thrown out in this process. My thought was to do this downsampling/rebalancing during dataloading so the model at least has a chance to see each sample, just at a lower rate. |
I recently encountered a similar issue while training a model with a batch normalization layer. Since batch normalization requires a batch size greater than 1 during training, the training process fails if a batch size of 1 is produced. There may be a potential solution discussed here, where using However, litdata/src/litdata/streaming/dataloader.py Lines 597 to 605 in c4c9117
|
Hey @jackcyc @esivonxay-cognitiv, Would any of you be willing to attempt a fix ? The CombinedDataset isn't well thought IMO and needs to be improved. It was designed for immense training where only a few epochs are made. Your use case is kinda of an edge case. I think we should re-write it using PyTorch Lightning for inspiration: https://github.com/Lightning-AI/pytorch-lightning/blob/50af052b3129164e28efa8b9321d733311b7b459/src/lightning/pytorch/utilities/combined_loader.py#L222 |
Hey Thomas, thanks for the followup. I haven't looked at the PyTorch Lightning implementation exhaustively, but thanks for bringing it to my attention. I don't currently have the bandwidth for this, but I'll put it on my list of todos and revisit fixing/re-writing this. |
🐛 Bug
I have two datasets which are unbalanced, where one dataset is 1000x larger than the other. I would like to sample from two of the datasets such that the ratio of samples from each is 1:100. When doing so, the batches are of irregular size are returned during iteration.
I think there are 2 issues which this test surfaces:
drop_last
does not appear to work as intended, since the last batch is not a full sized batchI don't think this is related to #179, but it's possible
I've been attempting to fix this, but I'm not sure what the root of the issue is. I would be very appreciative if you could fix this or point me in the right direction.
Thanks!
To Reproduce
Expected behavior
All batch sizes should be the same.
Additional context
This issue is independent of whether
drop_last
,shuffle
, andpersistent_workers
are set to True or FalseThe text was updated successfully, but these errors were encountered: