Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo Dataset and Dataloader #90

Open
Benny-Nottonson opened this issue May 16, 2024 · 2 comments
Open

Redo Dataset and Dataloader #90

Benny-Nottonson opened this issue May 16, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@Benny-Nottonson
Copy link
Contributor

No description provided.

@Benny-Nottonson Benny-Nottonson added the good first issue Good for newcomers label May 16, 2024
@josiahls
Copy link

josiahls commented May 23, 2024

I don't want to lick the cookie, but one of the things I'm excited about mojo is the type safety / memory management.

What are everyone's thoughts on torchdata ? I experimented with building a RL framework off it fastrl

Pros

  • Linking pipelines was very cool and making transforms was easy and I could build some very complex pipelines using it personally and for work.
  • Very horizontal inheritance. Learning how to do custom stuff / tear apart torchdata was very easy since the hierarchy was basically flat. All pipes inherited from IterDataPipe or MapDataPipe. I think onboarding new users is a lot easier because of this.
    • I was surprised how important this was. An issue I've seen from a lot of dataloading frameworks is they turn into OOP hell, and thus are very hard to extend. My understanding is ray data has this issue from talking to other research friends that tried using it / extending it.

Cons

  • The future of torchdata is hazy, and they vaguely/unhelpfully noted they need to redesign some stuff. Below are my guesses.
  • Limitations related to python:
    • How do you verify pipeline A -> B -> C is valid in python? e.g How do we know those pipes plug into eachother correctly? Python doesn't have type safety, so unless we somehow check the signatures in python / use pydantic this doesn't appear possible
    • How do you pass values / references between pipes reliably? e.g. You want to cache data at certain points in the pipeline, but don't want to duplicate the data from earlier in the pipeline.
    • If you have a pipeline and want to do multiprocessing, how do you nicely get around the python GIL?
      • torchdata was recently testing a dataloader2 that uses pub/sub/messaging but doesn't look like that got anywhere?
  • Limitations not related to python
    • Exception messages in pipelines (torchdata or not) are simply awful. If you have a pipeline A -> B -> C, and there is an exception in A, you will get a long stack trace all the way up the pipeline. I feel like this might be the achilles heel of a lot of pipeline dataloader frame works.
      • I think mojo has inlining / nodebug capabilities that can make this not so bad (skip internal functions), which would be otherwise not possible in python (?)
      • Probably needs an innovation here: Modify the exception / stack trace when using the pipelines so the stack traces are easier to read.

Some things I'm seeing that would be needed from mojo:

Major blockers

  • Iterable / Iterator / Gettable traits that pipes can implement.

Minor needs

  • yield / coreoutines. I think a working pipeline can be hack around this for now.

I'm curious what other frameworks / libs people have used, liked, disliked.

@StijnWoestenborghs
Copy link
Collaborator

StijnWoestenborghs commented May 24, 2024

Hi @josiahls, I think data pipelining is quite a complex, but important, topic that Basalt might not be focusing on soon, at least not in the near future. What I read about it is that torchdata suffers from lower level control over things like multiprocessing, and even though it should be possible in Mojo, other then algorithm.parallelize it doesn't have something like a threading API (yet!).

For sure the type safety and safely passing through references to the data without copies will and must be possible. And as a first rework of the current dataloader (which just simply loads all data in memory), I think an ultra simple pipeline that 'chunk-loads' the data in memory & passes it to the model like that should be the goal. Additionally Mojo might have an edge here with it's very convenient and easy to use compile time features. Are you perhaps interested in trying this out?

Long term thinking. I can see cloud storage integration & distributed computing being massively important here as well. And I wonder if that was one of the re-design evaluations of torchdata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants