DataLoader with NamedTuple #1221

cossio · 2020-06-12T00:09:12Z

Just a couple of small changes, so that DataLoader can be created with a NamedTuple of tensors instead of Tuple. This way the tensors can be referred to by name. For example

train_loader = DataLoader((images = Xtrain, labels = Ytrain), batchsize=16)
batch = first(train_loader)
y = model(batch.images)
logitcrossentropy(y, batch.labels)

If we only use tuples, then in datasets with multiple tensors one has to be careful about the order in which the tensors are fed into the DataLoader constructor and be consistent with this elsewhere. With NamedTuples one just have to be consistent about the names used, which I think is a minor improvement.

CC @CarloLucibello

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable

I don't think this qualifies as an API change. It's just a minor feature addition. So final review probably not required.

Final review from @MikeInnes or @dhairyagandhi96 (for API changes).

src/data/dataloader.jl

test/data.jl

CarloLucibello · 2020-06-12T09:45:16Z

I like this. Interaction with train! is slightly problematic though, since fields will be splatted as positional arguments of the loss. Maybe we can consider this to be just ok

DhairyaLGandhi · 2020-06-12T10:05:12Z

Rather than the Unions, would it make more sense have the NamedTuple constructors just forward to the regular ones.

Is there any other functional property that we are using here that warrants NamedTuple? Maybe it would be better to just expose these as kwargs in the first place, something like data and labels which could be their own tuples.

cossio · 2020-06-12T10:10:16Z

@CarloLucibello Hmm that's true. Maybe it's more consistent if we always call loss(d) instead of loss(d...), and let the user handle d directly? But probably better to discuss that elsewhere. In any case, using NamedTuples like this can be nicer for hand-written train loops.

@dhairyagandhi96 I don't understand? The idea is to be able to refer to the tensors by name, which can't be done if you convert the NamedTuple to a Tuple. Maybe I missunderstood your comment.

DhairyaLGandhi · 2020-06-12T11:27:48Z

Looking from the code, the naming of said tensors is to allow users some convenience while sending the input, right? Or is the intention for the keys to be used inside the loss function?

cossio · 2020-06-12T11:30:21Z

Looking from the code, the naming of said tensors is to allow users some convenience while sending the input, right? Or is the intention for the keys to be used inside the loss function?

I am using the keys inside the loss function (but I am also writing a training loop by hand). I think this could be a general use-case.

cossio · 2020-06-13T10:56:39Z

What do you think of #1227? I could try a fix here.

CarloLucibello · 2020-06-13T15:32:11Z

What do you think of #1227? I could try a fix here.

that should be fixed, but better do it in a separate PR

CarloLucibello · 2020-06-13T15:38:58Z

In the end, DataLoader should end up supporting any type with some dataset-like interface. Changes here only involve overloading _nobs and _getobs for named tuples, twhich seems very lean and reasonable and goes in that direction, therefore I think we should merge this PR

DhairyaLGandhi · 2020-06-15T15:41:07Z

src/data/dataloader.jl

-end
-
-_getobs(data::Tuple, i) = map(x -> _getobs(x, i), data)
+_getobs(data::AbstractArray, i) = data[ntuple(i -> Colon(), Val(ndims(data) - 1))..., i]


Use the N from the type and drop the Val?

Thanks for the suggestion, but why is that better?

It's largely for it to be cleaner, doing it like this doesn't seem to add any benefit and increases the complexity of the code while reading it

cossio · 2020-06-16T09:16:11Z

Merge?

CarloLucibello · 2020-06-16T09:25:27Z

needs a rebase

accept suggested changes Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>

selectdim can lead to type instability, see https://discourse.julialang.org/t/why-selectdim-is-type-instable/25271/5

cossio · 2020-06-16T11:33:41Z

rebased

MikeInnes · 2020-06-16T13:17:54Z

There are no 'minor' API changes that are allowed to go through without review; there are just API changes, and this is one of them. (The fact that you added documentation for it should be a giveaway).

I think this addition is fine though. It might be nice to generalise it (we could potentially reuse Functor here) but it's fine to make named tuples a special case wherever tuples are already, I think.

Interaction with train! is slightly problematic though, since fields will be splatted as positional arguments of the loss.

It would be helpful to add a test to make sure the behaviour is right, but I don't think this is the case, eg:

julia> +((a=1,b=2)...)
3

cossio · 2020-06-16T14:16:01Z

@MikeInnes The problem with train! is that this line:

Flux.jl/src/optimise/train.jl

Line 88 in 254e4a7

loss(d...)

doesn't propagate the tensor names. So the user has to be careful to define the loss function to take arguments in the correct order.

MikeInnes · 2020-06-16T15:15:31Z

Ah, I misread Carlo's post. Yes, if we want to avoid that we'd have to do something significantly more complex and it's better to keep this simple.

One option is to write train!(..., zip(DataLoader((a=a,b=b)), in which case loss will be passed the named tuple directly.

cossio · 2020-06-16T15:48:33Z

In any case this train! issue can be dealt with in another PR / issue, right? I don't think more changes are needed here.

MikeInnes · 2020-06-16T15:59:12Z

Depends; this PR is something of a decision point, because if we wanted train to behave some other way with named tuples later, we'd have to break the behaviour added here. I think the behaviour here is probably only sensible one, though.

@CarloLucibello can decide if he's happy with the details but the current API change LGTM, anyway.

CarloLucibello · 2020-06-16T17:21:12Z

Once #1149 implementing #1149 (comment) gets merged, train! will pass the named tuple to the loss without splatting. Therefore inside train! we will have loss(nt), which I think is better than loss(nt...) and essentially on par or even better than loss(; nt...). Users can also define define loss(nt) = loss(; nt...) if they really want to use keyword arguments.

Let's merge this
bors r+

cossio · 2020-06-16T17:31:36Z

I do like loss(; nt...) to handle this ...

bors · 2020-06-16T17:49:18Z

Build succeeded:

ci/gitlab/gitlab.com

cossio force-pushed the data branch 2 times, most recently from e1020c1 to 223a68b Compare June 12, 2020 00:27

CarloLucibello reviewed Jun 12, 2020

View reviewed changes

src/data/dataloader.jl Outdated Show resolved Hide resolved

src/data/dataloader.jl Outdated Show resolved Hide resolved

src/data/dataloader.jl Outdated Show resolved Hide resolved

src/data/dataloader.jl Outdated Show resolved Hide resolved

test/data.jl Outdated Show resolved Hide resolved

cossio force-pushed the data branch from be656b1 to a887426 Compare June 12, 2020 11:21

cossio mentioned this pull request Jun 13, 2020

Cannot do double iteration of DataLoader #1227

Closed

DhairyaLGandhi reviewed Jun 15, 2020

View reviewed changes

cossio and others added 6 commits June 16, 2020 13:31

DataLoader with NamedTuple

02ee6ba

news and docs

909a55a

Apply suggestions from code review

7569216

accept suggested changes Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>

simplify _getobs

cb34bb8

DataLoader type inference tests

1dbaf32

revert selectdim

9078f85

selectdim can lead to type instability, see https://discourse.julialang.org/t/why-selectdim-is-type-instable/25271/5

cossio force-pushed the data branch from 97c8798 to 9078f85 Compare June 16, 2020 11:33

cossio mentioned this pull request Jun 16, 2020

Cleaner training loop #1149

Merged

bors bot merged commit 19b45b4 into FluxML:master Jun 16, 2020

cossio deleted the data branch June 16, 2020 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader with NamedTuple #1221

DataLoader with NamedTuple #1221

cossio commented Jun 12, 2020 •

edited

Loading

CarloLucibello commented Jun 12, 2020

DhairyaLGandhi commented Jun 12, 2020

cossio commented Jun 12, 2020 •

edited

Loading

DhairyaLGandhi commented Jun 12, 2020

cossio commented Jun 12, 2020

cossio commented Jun 13, 2020

CarloLucibello commented Jun 13, 2020

CarloLucibello commented Jun 13, 2020

DhairyaLGandhi Jun 15, 2020

cossio Jun 15, 2020

DhairyaLGandhi Jun 17, 2020

cossio commented Jun 16, 2020

CarloLucibello commented Jun 16, 2020

cossio commented Jun 16, 2020

MikeInnes commented Jun 16, 2020

cossio commented Jun 16, 2020

MikeInnes commented Jun 16, 2020

cossio commented Jun 16, 2020

MikeInnes commented Jun 16, 2020

CarloLucibello commented Jun 16, 2020

cossio commented Jun 16, 2020

bors bot commented Jun 16, 2020

DataLoader with NamedTuple #1221

DataLoader with NamedTuple #1221

Conversation

cossio commented Jun 12, 2020 • edited Loading

PR Checklist

CarloLucibello commented Jun 12, 2020

DhairyaLGandhi commented Jun 12, 2020

cossio commented Jun 12, 2020 • edited Loading

DhairyaLGandhi commented Jun 12, 2020

cossio commented Jun 12, 2020

cossio commented Jun 13, 2020

CarloLucibello commented Jun 13, 2020

CarloLucibello commented Jun 13, 2020

DhairyaLGandhi Jun 15, 2020

Choose a reason for hiding this comment

cossio Jun 15, 2020

Choose a reason for hiding this comment

DhairyaLGandhi Jun 17, 2020

Choose a reason for hiding this comment

cossio commented Jun 16, 2020

CarloLucibello commented Jun 16, 2020

cossio commented Jun 16, 2020

MikeInnes commented Jun 16, 2020

cossio commented Jun 16, 2020

MikeInnes commented Jun 16, 2020

cossio commented Jun 16, 2020

MikeInnes commented Jun 16, 2020

CarloLucibello commented Jun 16, 2020

cossio commented Jun 16, 2020

bors bot commented Jun 16, 2020

cossio commented Jun 12, 2020 •

edited

Loading

cossio commented Jun 12, 2020 •

edited

Loading