Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly support Iterables as an alternative to DataLoaders in Trainer.{fit, validate, ...} #10696

Closed
awaelchli opened this issue Nov 23, 2021 · 6 comments · Fixed by #16726
Closed
Assignees
Labels
data handling Generic data-related topic design Includes a design discussion feature Is an improvement or enhancement trainer: fit
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Nov 23, 2021

🚀 Feature

  1. Support passing dataloaders of type Iterable to Trainer.{fit, validate, test, predict}
  2. Support returning dataloaders of type Iterable in LightningModule.*_dataloder() and LightningDataModule.*_dataloder()

Motivation

Lightning today supports iterating over a DataLoader (when creating an iterator from it) or multiple dataloaders. Some claims have been made in the past that the Trainer also supports Iterable, but this is not true today because a trivial example can demonstrate that it fails. In any case, our Trainer methods as well as the Lightning{Data}Module methods do not reflect this in their signature types.
Some applications require operating directly on an instance of Iterable and wrapping them inside a torch DataLoader/Dataset would not be desirable or feasible. An example from the medical domain: https://github.com/MIC-DKFZ/batchgenerators

Furthermore, the introduction of Iterables would most likely facilitate the integration of torch/data DataPipes which are soon to be included in PyTorch.

Pitch

Support the Iterable type.

Usage example:

trainer = Trainer()
model = Model()
data = [batch0, batch1, batch2]  # a list of batches
trainer.fit(model, train_dataloaders=data)

Several places inside data_loading.py would have to be updated with branching logic that excludes setup steps normally done for DataLoaders (see limitations section below).

Limitations and Challenges

Some features in Lightning will not work for Iterables. Most notably:

Alternatives

  • Do not support Iterable. The user will have to wrap their Iterables into IterableDataset and compose them into a DataLoader.
  • Add support only in Lite. Drawback: Users converting to Lightning will have questions.

Additional context

#10279 attempted to add support for Iterable types in Lite, but it was reverted later on due to incompleteness.


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @justusschock @awaelchli @ninginthecloud @tchaton @rohitgr7

@awaelchli awaelchli added feature Is an improvement or enhancement data handling Generic data-related topic trainer: fit design Includes a design discussion labels Nov 23, 2021
@justusschock
Copy link
Member

+1 for this.

Regarding the limitations:
Back when there was still support for that, stuff like patching the sampler has always been done only for DataLoaders or subclasses. Using other iterables required to manually take care of such things and I think that this is the only way we can actually support this given the arbitrary implementations of such iterables.

Same goes for the fault tolerance IMO.

Regarding the ambiguity of the types for the loader:
One Option would be to only allow support for a single loader (which could then be a combined loader).
While this would be more explicit (and also allow us to remove some arguments from the trainers init), it would be a step back in terms of UX. For the dicts you could explicitly check for mappings and for the sequences maybe investigate the type of the first element?

@tchaton
Copy link
Contributor

tchaton commented Nov 24, 2021

Hey @awaelchli,

Did you find a demand for such a feature from the community?

@justusschock
Copy link
Member

@tchaton we once had support for it and also I mentioned that I know a few people not using PyTorch dataloaders (sometimes including myself). Not sure when the regression happened since we don't have tests for it, but I think this is a pretty strong restriction to make.

@stale
Copy link

stale bot commented Dec 26, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 26, 2021
@awaelchli awaelchli modified the milestones: 1.6, 1.7 Dec 26, 2021
@stale stale bot closed this as completed Jan 9, 2022
@awaelchli
Copy link
Contributor Author

@Stale you blind??? This had a MILESTONE!!!

@awaelchli awaelchli removed the won't fix This will not be worked on label Jan 16, 2022
@awaelchli awaelchli reopened this Jan 16, 2022
@alessiamarcolini
Copy link
Contributor

Hey maintainers :) I would be really looking forward to this feature 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic design Includes a design discussion feature Is an improvement or enhancement trainer: fit
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants