Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn on using tasks in dataloader #1530

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

DhairyaLGandhi
Copy link
Member

Allows using Channels and task based iteration by default, making it easier to hook into the new Channels based approach of CUDA as well, keeping the main thread free. Largely backwards compatible, and we can always have hooks. with this we can move preprocessing on GPU to a different task, so we don't have to block the main task which can continue with training.

@DhairyaLGandhi
Copy link
Member Author

This currently switches on shuffling, so the tests would need to be modified.

@DhairyaLGandhi
Copy link
Member Author

@maleadt do you think this would appropriately use the Task parallelism in CUDA?

@DhairyaLGandhi DhairyaLGandhi changed the title Dg/data Turn on using tasks in dataloader Mar 8, 2021
@darsnack
Copy link
Member

darsnack commented Mar 8, 2021

I would suggest this effort is better spent in DataLoaders.jl and cleaning it up/preparing it for GPU tasks.

@maleadt
Copy link
Collaborator

maleadt commented Mar 8, 2021

It's not clear to me where and what kind GPU operations would be happening here. Assuming it's mostly memory operations, it will need a solution to JuliaGPU/CUDA.jl#735 before this can efficiently use the GPU though, either in the form of a call to CUDA.pin(::Array) or a PinnedArray abstraction (CPU buffers need to be 'pinned' for CUDA to be able to perform memory operations asynchronously on them).

@darsnack
Copy link
Member

darsnack commented Mar 8, 2021

In the case of DataLoaders.jl, the memory is loaded into ring buffers under the control of the dataloader. So we can manually pin the memory.

I don't want to derail this PR too much, since it's WIP. Just wanted to add the the direction it is headed looks a lot like the work in DataLoaders.jl (i.e. channels/tasks by default, getobs interface is being reimplemented here, buffering, etc.). Only pointing it out so that work is not duplicated.

@DhairyaLGandhi
Copy link
Member Author

So the interface might be to have gpu operations happen either at getobs, or I had a helper f in there where a majority of preprocessing/ validations can happen. I'd removed it prior to pushing, but we can introduce that so its more obvious.

@DhairyaLGandhi
Copy link
Member Author

Let me know what needs to happen for the pinning, we can add tests for it as well.

@maleadt
Copy link
Collaborator

maleadt commented Mar 8, 2021

Let me know what needs to happen for the pinning, we can add tests for it as well.

Nothing is required, so you can go ahead and benefit from multitasking already for kernels and other operations. Once we have proper memory pinning, we can add the use of those APIs here.

@CarloLucibello
Copy link
Member

@DhairyaLGandhi it would be useful if you could review DataLoaders.jl and submit changes there if needed, as @darsnack said a lot of work has already gone there, so it would be good to cumulate effort instead of duplicating it. We can then bring the package in the org and re-export from Flux.jl

@DhairyaLGandhi
Copy link
Member Author

@KristofferC is this an appropriate setup for using Channels to parallelise data loading or do you think we would be better off doing the regular iteration?

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Mar 29, 2021

The utility here has a slightly different goal, which is to generalise to arbitrary tuples of data - while also making it possible to have benefits of lazy loading or manage loading functions via the first argument. Why can't we pin memory here? Seems reasonable to me.

The getobs is an artefact of the previous implementation here, but using tasks and the scheduler is definitely different. I think we want to be set the API expectations here rather than externally.

@DhairyaLGandhi
Copy link
Member Author

Cleaned up the case where its a tuple of args. It also seems to have the side effect of making it slightly more usable.

for paths_to_sampled_images in DataLoader(sampling_function, args)
  for minibatch in DataLoader(preprocess+gpu, paths_sampled_images)
    ...
  end
end

Currently requires the lengths of all the elements of the tuples to agree for proper zippage, but maybe we can automatically duplicate singleton values like reals or nothing etc.

@CarloLucibello
Copy link
Member

Adding functionality unrelated to tasks shouldn't be part of this PR and discussed separately.
Also, a few comments about why you are doing this at all instead of leveraging DataLoaders.jl have been ignored...

@DhairyaLGandhi
Copy link
Member Author

This also came up in the DDP work, where we could tier and compose data loading of large data sets. It is handy to have a loader which can also handle the gpu movement in the background etc too. It's an improvement since it doesn't tie Flux to any specific tools to implement data loaders with either.

mashu and others added 2 commits December 1, 2021 10:04
…n case we want to onehot and pad per batch so it's not the same as input.
Relax type of returned values by DataLoader i…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants