Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader with NamedTuple #1221

Merged
merged 6 commits into from
Jun 16, 2020
Merged

DataLoader with NamedTuple #1221

merged 6 commits into from
Jun 16, 2020

Conversation

cossio
Copy link
Contributor

@cossio cossio commented Jun 12, 2020

Just a couple of small changes, so that DataLoader can be created with a NamedTuple of tensors instead of Tuple. This way the tensors can be referred to by name. For example

train_loader = DataLoader((images = Xtrain, labels = Ytrain), batchsize=16)
batch = first(train_loader)
y = model(batch.images)
logitcrossentropy(y, batch.labels)

If we only use tuples, then in datasets with multiple tensors one has to be careful about the order in which the tensors are fed into the DataLoader constructor and be consistent with this elsewhere. With NamedTuples one just have to be consistent about the names used, which I think is a minor improvement.

CC @CarloLucibello

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable

I don't think this qualifies as an API change. It's just a minor feature addition. So final review probably not required.

  • Final review from @MikeInnes or @dhairyagandhi96 (for API changes).

@cossio cossio force-pushed the data branch 2 times, most recently from e1020c1 to 223a68b Compare June 12, 2020 00:27
src/data/dataloader.jl Outdated Show resolved Hide resolved
src/data/dataloader.jl Outdated Show resolved Hide resolved
src/data/dataloader.jl Outdated Show resolved Hide resolved
src/data/dataloader.jl Outdated Show resolved Hide resolved
test/data.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

I like this. Interaction with train! is slightly problematic though, since fields will be splatted as positional arguments of the loss. Maybe we can consider this to be just ok

@DhairyaLGandhi
Copy link
Member

Rather than the Unions, would it make more sense have the NamedTuple constructors just forward to the regular ones.

Is there any other functional property that we are using here that warrants NamedTuple? Maybe it would be better to just expose these as kwargs in the first place, something like data and labels which could be their own tuples.

@cossio
Copy link
Contributor Author

cossio commented Jun 12, 2020

@CarloLucibello Hmm that's true. Maybe it's more consistent if we always call loss(d) instead of loss(d...), and let the user handle d directly? But probably better to discuss that elsewhere. In any case, using NamedTuples like this can be nicer for hand-written train loops.

@dhairyagandhi96 I don't understand? The idea is to be able to refer to the tensors by name, which can't be done if you convert the NamedTuple to a Tuple. Maybe I missunderstood your comment.

@DhairyaLGandhi
Copy link
Member

Looking from the code, the naming of said tensors is to allow users some convenience while sending the input, right? Or is the intention for the keys to be used inside the loss function?

@cossio
Copy link
Contributor Author

cossio commented Jun 12, 2020

Looking from the code, the naming of said tensors is to allow users some convenience while sending the input, right? Or is the intention for the keys to be used inside the loss function?

I am using the keys inside the loss function (but I am also writing a training loop by hand). I think this could be a general use-case.

@cossio
Copy link
Contributor Author

cossio commented Jun 13, 2020

What do you think of #1227? I could try a fix here.

@CarloLucibello
Copy link
Member

What do you think of #1227? I could try a fix here.

that should be fixed, but better do it in a separate PR

@CarloLucibello
Copy link
Member

In the end, DataLoader should end up supporting any type with some dataset-like interface. Changes here only involve overloading _nobs and _getobs for named tuples, twhich seems very lean and reasonable and goes in that direction, therefore I think we should merge this PR

end

_getobs(data::Tuple, i) = map(x -> _getobs(x, i), data)
_getobs(data::AbstractArray, i) = data[ntuple(i -> Colon(), Val(ndims(data) - 1))..., i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the N from the type and drop the Val?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, but why is that better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's largely for it to be cleaner, doing it like this doesn't seem to add any benefit and increases the complexity of the code while reading it

@cossio
Copy link
Contributor Author

cossio commented Jun 16, 2020

Merge?

@CarloLucibello
Copy link
Member

needs a rebase

cossio and others added 6 commits June 16, 2020 13:31
@cossio
Copy link
Contributor Author

cossio commented Jun 16, 2020

rebased

@MikeInnes
Copy link
Member

There are no 'minor' API changes that are allowed to go through without review; there are just API changes, and this is one of them. (The fact that you added documentation for it should be a giveaway).

I think this addition is fine though. It might be nice to generalise it (we could potentially reuse Functor here) but it's fine to make named tuples a special case wherever tuples are already, I think.

Interaction with train! is slightly problematic though, since fields will be splatted as positional arguments of the loss.

It would be helpful to add a test to make sure the behaviour is right, but I don't think this is the case, eg:

julia> +((a=1,b=2)...)
3

@cossio
Copy link
Contributor Author

cossio commented Jun 16, 2020

@MikeInnes The problem with train! is that this line:

loss(d...)

doesn't propagate the tensor names. So the user has to be careful to define the loss function to take arguments in the correct order.

@MikeInnes
Copy link
Member

Ah, I misread Carlo's post. Yes, if we want to avoid that we'd have to do something significantly more complex and it's better to keep this simple.

One option is to write train!(..., zip(DataLoader((a=a,b=b)), in which case loss will be passed the named tuple directly.

@cossio
Copy link
Contributor Author

cossio commented Jun 16, 2020

In any case this train! issue can be dealt with in another PR / issue, right? I don't think more changes are needed here.

@MikeInnes
Copy link
Member

Depends; this PR is something of a decision point, because if we wanted train to behave some other way with named tuples later, we'd have to break the behaviour added here. I think the behaviour here is probably only sensible one, though.

@CarloLucibello can decide if he's happy with the details but the current API change LGTM, anyway.

@CarloLucibello
Copy link
Member

Once #1149 implementing #1149 (comment) gets merged, train! will pass the named tuple to the loss without splatting. Therefore inside train! we will have loss(nt), which I think is better than loss(nt...) and essentially on par or even better than loss(; nt...). Users can also define define loss(nt) = loss(; nt...) if they really want to use keyword arguments.

Let's merge this
bors r+

@cossio
Copy link
Contributor Author

cossio commented Jun 16, 2020

I do like loss(; nt...) to handle this ...

@cossio cossio mentioned this pull request Jun 16, 2020
@bors
Copy link
Contributor

bors bot commented Jun 16, 2020

Build succeeded:

@bors bors bot merged commit 19b45b4 into FluxML:master Jun 16, 2020
@cossio cossio deleted the data branch June 16, 2020 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants