Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loading & preprocessing pipeline feature #1282

Open
ageron opened this issue Jul 15, 2020 · 43 comments
Open

Data loading & preprocessing pipeline feature #1282

ageron opened this issue Jul 15, 2020 · 43 comments

Comments

@ageron
Copy link

ageron commented Jul 15, 2020

The DataLoader is nice, but if I understand correctly it requires the dataset to fit in memory. For large datasets that don't fit in memory, it would be nice to have an easy way to load & preprocess the data efficiently, similar to TensorFlow's tf.data API. Maybe something like this exists already?

If not, perhaps one option would be to provide custom transducers to make it possible to write things like:

data = (csv_file_paths |> Shuffle(length(csv_file_paths)) |> Interleave(CSV.File; threads=4)
             |> Map(preprocess_sample) |> Shuffle(100_000) |> Batch(32) |> Prefetch(1))

This would load records from multiple files (in random file order), pick 4 randomly, interleave their records, preprocess every record, shuffle records using a 100,000 element buffer, and batch the records with batch size 32, and prefetch 1 batch (so the CPU can prepare the next batch while the GPU is working on the previous batch). Then the data could be used for training.

@pxl-th
Copy link
Member

pxl-th commented Jul 16, 2020

Right now it is possible to use DataLoader with datasets that do not fit into memory, but feels somewhat hacky... or not.
Let's say we have dataset of images, which we want to load on demand.
Then we can define a custom struct subtyping AbstractArray.

struct Dataset{T, N} <: AbstractArray{T, N}
    frame_template::FormatExpr
    targets::AbstractArray
end

Dataset{T}(frame_template, targets) where {T} =
    Dataset{T, 1}(frame_template, targets)

Then defining getindex function that describes how we load one item

function Base.getindex(d::Dataset{T}, i::Int) where {T}
    path = format(d.frame_template, i - 1)
    image = path |> FileIO.load |> Images.channelview .|> T
    image, d.targets[[i]]
end

and how we load mini-batch

function Base.getindex(d::Dataset{T}, ids::Array) where {T}
    x, y = d[ids[1]]
    xs_last_dim = ntuple(i -> Colon(), ndims(x))
    ys_last_dim = ntuple(i -> Colon(), ndims(y))

    xs = Array{T}(undef, size(x)..., length(ids))
    ys = Array{T}(undef, size(y)..., length(ids))

    xs[xs_last_dim..., 1] .= x
    ys[ys_last_dim..., 1] .= y

    for (i, id) in enumerate(ids[2:end])
        x, y = d[id]
        xs[xs_last_dim..., i + 1] .= x
        ys[ys_last_dim..., i + 1] .= y
    end
    xs, ys
end

And some helper functions

Base.IndexStyle(::Type{Dataset}) = IndexLinear()
Base.size(d::Dataset) = (length(d.targets),)
Base.length(d::Dataset) = length(d.targets)

This in some sense mimicks PyTorch's Dataset
and allows to use DataLoader with datasets that do not fit into memory

frame_template = FormatExpr(raw".\frames\frame-{:d}.jpg")  # Template path for images.
targets = load_from_txt(raw".\speed.txt")  # Array of targets

dataset = Dataset{Float32}(frame_template, targets)
loader = Flux.Data.DataLoader(dataset, batchsize=4, shuffle=true)
println("Loader length: $(length(loader))")
for (i, (x, y)) in enumerate(loader)
    i == 10 && break
    println("$i: $(size(x)) $(size(y))")
end

Output of the data from my example:

Loader length: 240
1: (3, 160, 320, 4) (1, 4)
2: (3, 160, 320, 4) (1, 4)
3: (3, 160, 320, 4) (1, 4)
4: (3, 160, 320, 4) (1, 4)
5: (3, 160, 320, 4) (1, 4)
6: (3, 160, 320, 4) (1, 4)
7: (3, 160, 320, 4) (1, 4)
8: (3, 160, 320, 4) (1, 4)
9: (3, 160, 320, 4) (1, 4)

@CarloLucibello
Copy link
Member

For the time being, we can just document the interface that a "Dataset" should expose in order to be compatible with the DataLoader. @pxl-th a PR in this direction would be very welcome.

In the longer run, we should definitely consider reimplementing the DataLoader on top of transducers. Transducers are great and come fully packed with features, as @ageron showed.

@ageron
Copy link
Author

ageron commented Jul 18, 2020

Thanks for your detailed answer @pxl-th . I'm not sure whether my code example really makes sense, but it's the kind of API I would imagine, largely inspired from TF's tf.data API, with a transducers twist. I'm happy to help if you want.

@darleybarreto
Copy link

I wonder if something more idiomatic could be done, like:

# I can call custom now and it will return three objects
@dataset (:train) image,target1,target2 function custom(path_arrays,idx)
    image = # load from the path_arrays[idx] + ...
    target1 = # load from the path_arrays[idx] + ...
    target2 = # load from the path_arrays[idx] + ...
end

or

# add another method to dataset
function dataset(path_arrays,idx, :train)
    image = # load from the path_arrays[idx] + ...
    target1 = # load from the path_arrays[idx] + ...
    target2 = # load from the path_arrays[idx] + ...

    image,target1,target2 
end

And another type called DataSet could be added having an inner field (_data or something like that), which could be the the array of paths (path_arrays) or the true data, depending on the user choice. And the data inside the DataLoader is a DataSet, which samples from the corresponding dataset in :train, :val, :test or any other custom symbol defining a step.

I am not a Julia expert, but I could help implementing it 😄 .

@paritosh5feb
Copy link

Hi everyone! I am trying to write a CNN (U-Net) in Flux ML Library of Julia. However the first hurdle that I am facing is, how to load images from folder to train the model. I have searched the Internet extensively for any method, but no luck. All the examples on the internet for CNN in Flux, use library functions to import pre-existing data-sets like MNIST dataset. Actually all the examples, use a library functions to load this dataset. Please someone tell me, how to load custom image dataset from folders to train a CNN model written in Flux Julia. I would be really grateful!

@DhairyaLGandhi
Copy link
Member

You can look at UNet.jl

@paritosh5feb
Copy link

You can look at UNet.jl
Can you tell me how to load the images from the folder for training the model. Like there was flowfromdirectory function in Keras (python).

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jun 21, 2021

There's a src/dataloader.jl in UNet.jl which talks about loading single images and also batches of images. You can also look at DataSets.jl which helps keep track of directory structures. Basically, you need to have a way to point to the images you want to load and then use packages such as Images.jl, FilelO.jl ImageMagick.jl, ArchGDAL.jl etc to suit the needs of your specific directory structure and file types. Then you can put it into the format Flux expects (see the tutorials on https://fluxml.ai). There's also Metalhead.preprocess but that typically doesn't manage the directory structure and DataSets.jl can be used to mimic the keras function you mentioned.

Most libraries would have a form of load or read function that you can pass a string or IO/IOBuffer type object to load it into Julia arrays.

@paritosh5feb
Copy link

Can I use the load_batch function to output x_train, y_train, so that both could be fed in the -
DataLoader(x_train, y_train, batchsize=N) to create the full data_set

@DhairyaLGandhi
Copy link
Member

Quite possibly. Its hard to say without knowing how your dataset is structured. The function you mention assumes that the images and labels exist together in a directory, and every directory points to a different class. If you write a function that can generate valid paths to images and their corresponding labels, then you can use load_img too. It might be useful to see how that's written anyway, it's quite easy.

@paritosh5feb
Copy link

My Data directory is structured as follows:-
data/train/images - contains images named as follows:- id.png
data/train/masks - contains the corresponding masks named as follows:- id.png
data/train.csv - contains names of all the images in data/train/images

@DhairyaLGandhi
Copy link
Member

So you'd use the CSV to generate valid paths to the data and corresponding labels and shoot that to load_img (quite possibly you'll broadcast over a vector of strings or vector of tuple of strings) and load the images and labels like https://github.com/DhairyaLGandhi/UNet.jl/blob/2b8f2393a8bf9895f69bb8493fbae84d7c9f9c35/src/dataloader.jl#L7

You can copy this function and replace this line with one that doesn't append the .tif extension too

@paritosh5feb
Copy link

paritosh5feb commented Jun 21, 2021

Well, would this function be correct, if I want to load all the images at once and then feed x_train, y_train to DataLoader function:-

function load_data(base_dir, n)
    img_dir = "/images/"
    mask_dir = "/masks/"
    x = zeros(Float32, rsize..., 1, n) # []
    y = zeros(Float32, rsize..., 1, n) # []
    for i in train_df[!, "id"]
        img = load(joinpath(base_dir, img_dir, i))
        mask = load(joinpath(base_dir, mask_dir, i))
        #img = imresize(img, rsize...)
        #mask = imresize(mask, rsize...)
        img = channelview(img)
        mask = channelview(mask)
        #img = reshape(img, rsize..., 1)
        #mask = reshape(mask, rsize..., 1)
        x′ = @view x[:,:,:,i]
        x′ .= img
        y′ = @view x[:,:,:,i]
        y′ .= mask
      end
    x, y
end

@paritosh5feb
Copy link

So you'd use the CSV to generate valid paths to the data and corresponding labels and shoot that to load_img (quite possibly you'll broadcast over a vector of strings or vector of tuple of strings) and load the images and labels like https://github.com/DhairyaLGandhi/UNet.jl/blob/2b8f2393a8bf9895f69bb8493fbae84d7c9f9c35/src/dataloader.jl#L7

You can copy this function and replace this line with one that doesn't append the .tif extension too

But the load_img() function will load one image at a time, right. I want to load all the images at once to x_train, and all the masks to y_train. What should I do in that case.

@paritosh5feb
Copy link

paritosh5feb commented Jun 21, 2021

I wrote this function based on the load_batch() function. But why is this function only returning array filled with zeroes.

function load_data(base_dir, n, rsize = (101,101))
    img_dir = "images/"
    mask_dir = "masks/"
    x = zeros(Float64,rsize...,3, n) # []
    y = zeros(UInt8, rsize..., 1,n)
    count = 1
    for i in train_df[!, "id"]
        try
            if isfile(string(joinpath(base_dir, img_dir, i), ".png")) && isfile(string(joinpath(base_dir, mask_dir, i), ".png"))
                img = load(string(joinpath(base_dir, img_dir, i), ".png"))
                mask = load(string(joinpath(base_dir, mask_dir, i), ".png"))
                #img = imresize(img, rsize...)
                #mask = imresize(mask, rsize...)
                img = channelview(img)
                mask = channelview(mask)
                #img = reshape(img, rsize..., 1)
                #mask = reshape(mask, rsize..., 1)
                x′ = @view x[:,:,:,count]
                x′ .= img
                y′ = @view y[:,:,:,count]
                y′ .= mask
                append(x,x′)
                append(y, y′)
                count += 1
            end
        catch ArgumentError
        end
      end
    x, y
end

@DhairyaLGandhi
Copy link
Member

An approach such as #1530 combined with DataSets.jl works well with the synchronised data parallel training. That way we could nest loaders to pass on subsets of the paths to different workers which themselves loaded up samples of data from S3 (or disk) to train with. We were able to amortise the cost of loading from file/ network since the loading was itself happening asynchronously as was the data transfer. None of the samples or the entire data could fit in memory. cc @c42f

@johnnychen94
Copy link
Contributor

johnnychen94 commented Aug 30, 2021

Integrated with DataSets.jl I'm imaging something like:

data = Dataloader(dataset("MNIST"))

for (x,y) in data
    ...
end

and let the under-the-hood implementation to be something like:

blobtree = open(dataset("MNIST"))
data = Dataloader(mappedarray(load, blobtree))

but if JuliaComputing/DataSets.jl#17 is implemented then it would be much simplified as

data = Dataloader(open(dataset("MNIST")))

My eventual goal in JuliaML/MLDatasets.jl#73 (comment) is to have every dataset backed by Datasets.jl with some wrapper container (with downloader supports where you can find some initial discussion in oxinabox/DataDeps.jl#144)

@DhairyaLGandhi
Copy link
Member

It probably doesn't even need to be using MappedArrays, since there may be subtleties we're not capturing here. Small datasets or custom data types or the like come to mind. I want it to be a bit composition based so DataSets.jl kicks in when it's advantageous to do so.

@johnnychen94
Copy link
Contributor

johnnychen94 commented Aug 30, 2021

Just to clarify a bit, batch loading + lazy file loading can be implemented quite easily using MLDataPattern and MappedArrays:

using MLDataPattern
using FileIO, TestImages
using ImageCore, ImageShow

datadir = dirname(testimage("camera", download_only=true))
data = mappedarray(readdir(datadir, join=true)) do filename
    Gray{Float32}.(load(filename)[1:32, 1:32, 1])::Matrix{Gray{Float32}}
end;

dataset = batchview(data, size=2);
dataset[1] # a list of two images

for X in dataset
    X_gpu = gpu(Flux.stack(dataset[1], 3))
    # do some training...
end

IMHO the more important purpose of Dataloader is to replace the "naive" batchview by 1) background threadings support, which is now supported by DataLoaders.jl already, and 2) be smart and preallocate a buffer to reduce the overhead in Flux.stack. Dataloader should be agnostic of how each data is fetched, that can be implemented independently as simple as mappedarray or as complicated as ClassificationDataset/SegmentationDataset/... who have prior information about how datasets are organized. And this dataset organization problem falls perfectly into the scope of DataSets.jl, just need to explore how to provide a convenient interface; do we want to manage the dataset organization as toml files using predefined loaders, or do we want to write more julia codes to talk to DataSets.jl?


  1. be smart and preallocate a buffer to reduce the overhead in Flux.stack

For this I'm actually not sure of, most of the Flux network requires the data to be stacked into a big numerical array because the backends(e.g., CuDNN) require so. But to provide more generic support to also other training patterns who has native support for vector of array layouts, there won't be such a need.

@DhairyaLGandhi
Copy link
Member

Right, the argument is to generalise outside of the computer vision case as well. Doing fetching, loading, preprocessing, moving (to gpu) in the background is also present in the data parallel context. Scaling this for uses outside of images with a clean API is arguably what motivated #1530

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Aug 30, 2021

Having plug and play loaders is the most idiomatic approach I think. It allows for uses that load large quantities of data or just ingest arguments to use and sample data from it. But this should not come with a function to overload imo since that can limit what may work well with different parts of the pipeline. Not just in terms of implementation but also uncommon data sources and API. I am imagining needing this for non-image data as well, since the pipeline is the same either way, so I would like something that can generalise better than be specialised to images (which I believe can happen once we have a pipeline set up).

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Aug 30, 2021

Ideally, this should also not come with strict set API to overload but fall out of the general set of guidelines around iteration and so on. I'd like to not have to write overloads, since different data sets may require different API from one to the next. The case with MLDataPatterns seems like it would require adhering to its rules, and lazy views/ types etc, which may not always scale well, or require creating extra copies with AD. AtomGraphs is the type to try alongside images (from AtomicGraphNets)

@johnnychen94
Copy link
Contributor

Ideally, this should also not come with strict set API to overload but fall out of the general set of guidelines around iteration and so on.

The reality is that without a set of traits/methods to dispatch and overload, there will be ambiguities here and there. Letting DataLoader guess what each dimension in Array{Float32, 4} means can not be applied to general cases. It's just a matter of dispatching MLDataPatterns' predefined methods, or invent Flux's own methods to dispatch.

@darsnack
Copy link
Member

fall out of the general set of guidelines around iteration and so on

MLDataPattern is just getindex/length/iterate but with observation dimensions added on. It is what falls out. As @johnnychen94 mentioned, we should not guess what dimensions mean or force dimension ordering.

The case with MLDataPatterns seems like it would require adhering to its rules, and lazy views/ types etc, which may not always scale well

There are no rules that say you have to be lazy or use views. Much like the indexing/iteration interface in Base Julia, you only need to implement two functions -- one to access a sample, and another to specify how many samples there could be.

@darsnack
Copy link
Member

FWIW I have been thinking about refactoring how observation dimensions are specified in MLDataPattern which might reduce the interface down to literally the Base indexing/iteration interface.

@DhairyaLGandhi
Copy link
Member

It is what falls out.

I don't really see it, still. It seems to be rediscovering indexing/ iteration from Base. Some common types we can host, and we do too.

Much like the indexing/iteration interface in Base Julia, you only need to implement two functions

Right, and I'd like to overload the iteration protocol in Base and let that be the interface so to speak.

@ToucheSir
Copy link
Member

ToucheSir commented Aug 30, 2021

We already know the functionality in Base is not sufficient, otherwise https://github.com/FluxML/Flux.jl/blob/master/src/data/dataloader.jl#L105-L119 would never have been written.

Where this additional functionality should live is the question. If Python ML libraries have taught us anything, it's that having your own siloed API for data containers is a surefire way to increase fragmentation and redundant work across the ecosystem.

@DhairyaLGandhi
Copy link
Member

#1530 goes on to clean up the code there.

Letting DataLoader guess what each dimension in Array{Float32, 4} means can not be applied to general cases.

Correct, that's why batchdim is a function in #1530.

If Python ML libraries have taught us anything ...

The discussion is geared towards flexible and general pipelining for a variety of ML workflows, not specific packages. For example one may have to work with something like ArchGDAL and it's easier for users to then use their API directly to handle coordinates depending on how it's laid out. By removing the need for overloading a specific function, you've made it possible to reduce boilerplate. Specific implementations (with their own high level API such as MLDatapattern) can of course coexist with more common image loading cases, which themselves can be specialised as Segmentation/ Localization etc tasks.

@ToucheSir
Copy link
Member

Sure, that removes the need to implement a couple non-base functions for the tradeoff of having to implement a substantial chunk of the AbstractArray API. Or alternatively, to re-implement the guts of DataLoader itself in #1530. By all means, let's discuss how to design these interfaces, but stuffing more non-composable functionality into Flux.Data.DataLoader is not the way to go.

@DhairyaLGandhi
Copy link
Member

It's fairly common to update implementations... If the argument is that it's not composable then I'd defer you to help improve that PR.

@ToucheSir
Copy link
Member

First off, let me say that I really appreciate the time and thought you've put into revamping Flux's data APIs. It's certainly much easier to stand off to the side and comment than actually roll up one's sleeves and get working code.

That said, the reason I'm not motivated to improve PRs here is because I can, right this moment, run ] add DataLoaders and get a multi-threaded implementation which

  • Works with arrays just as well and is already multi-threaded
  • Is more well-behaved with iteration (e.g. first(dl) == first(dl) holds if the input supports random access)
  • Exports composable pieces for all of its functionality which follow a common interface. In other words, you can mix, match and add your own functionality without requiring any internal changes.

There are of course sticking points with this approach. It's a third party dependency with a handful of transitive deps Flux doesn't rely on. It implements an interface that is neither ubiquitous nor universally agreed upon across the Julia data ecosystem. However, it is the closest thing we have to an agreed upon interface across different libraries at present. Concerns about having to implement yet another set of functions to make one's data source work are valid, but I'm not sure the remedy is to roll our own version.

@johnnychen94
Copy link
Contributor

If there are concerns about the future maintenance(e.g., breakage, commitment), we could ask if @lorenzoh is willing to host DataLoaders.jl in FluxML so as to better coordinate with Flux and FastAI. Ask this because it's not clear to me at the moment what #1530 brings with the presence of DataLoaders.jl.

@lorenzoh
Copy link
Member

lorenzoh commented Aug 31, 2021

I agree with @darsnack that standardizing on the LearnBase.getobs/nobs is a good idea, and Base.getindex/length/iterate are not enough. It makes it possible to define lazy datasets, but doesn't require it; as an example datasubset creates a lazy subset of a container by default, but datasubset(::Vector) simply creates a view. The approach brings the benefit of full generality for unknown types while allowing specific, possibly more performant implementations for known types, so I think it scales pretty well. DataLoaders.jl also implements composable primitives, i.e. DataLoader(data, bs) is really eachobsparallel(batchviewcollated(data, bs)) where eachobsparallel implements the parallel, buffered loading and batchviewcollated is a lazy collated batch view around any data container, s.t. if getobs!(data, i) is implemented, buffered loading will also work for batchviewcollated.

For an example of how the LearnBase API can be used, also see this tutorial in the FastAI.jl docs.

  1. be smart and preallocate a buffer to reduce the overhead in Flux.stack

Since it hasn't been mentioned, I want to point out that DataLoaders.jl already does this for any data container that supports getobs!. Basically it preallocates 2 collated batches and has every thread mutate a slice of one, thereby composing inplace, buffered loading and threading for arbitrary data containers. The collation logic can also be overloaded for custom types, out of the box it supports arrays, tuple, named tuples and dictionaries. It also gives a dispatch type that defines which dimension is considered the batch dimension.

we could ask if @lorenzoh is willing to host DataLoaders.jl in FluxML so as to better coordinate with Flux and FastAI

I'm for it, it's already used extensively in FastAI.jl anyway and I think supports everything people want to do. I'm not super familiar with DataSets.jl but I can imagine using that as a backend for a getobs data container should be straightforward as well.

@lorenzoh
Copy link
Member

@paritosh5feb
To answer the original question about loading data for semantic segmentation, do have a look at the FastAI.jl semantic segmentation tutorial which loads data in a very similar format and its data container tutorial which should help you get started.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Aug 31, 2021

It may not be the safest approach to create views on iteration. Mutation of the data etc can cause further training over multiple epochs to pollute results. Having said that, it's trivial to change the getindex call to a view in #1530 as well. It's composition characteristics are noted in the pr, you can give it tuples of arrays or arguments that can generate data on the fly, get asynchronous iteration/ loading etc. You can control the memory footprint and network transfers happen in the background and it already works with DataSets.jl. so you can load various formats the same as any image/ array.

@CarloLucibello CarloLucibello mentioned this issue Feb 15, 2022
2 tasks
@fpartl
Copy link
Contributor

fpartl commented Apr 8, 2022

Please help! After updating Flux to the latest version the plx-ht's brilliant solution stopped working... what should I do?

@darsnack
Copy link
Member

darsnack commented Apr 8, 2022

What error are you getting? I don't immediately see any reason that it shouldn't work. For reference, just defining:

function Base.getindex(d::Dataset{T}, i::Int) where {T}
    path = format(d.frame_template, i - 1)
    image = path |> FileIO.load |> Images.channelview .|> T
    image, d.targets[[i]]
end
Base.length(d::Dataset) = length(d.targets)

is all that is necessary for it to work.

@fpartl
Copy link
Contributor

fpartl commented Apr 9, 2022

After removing ~/.julia/compiled directory my tests are green again... weird. Anyway, sorry for the spam and thank you. 👍

Or maybe there were some fixes in the past 17 hours? I am not versioning my manifest so I do not know.

@fpartl
Copy link
Contributor

fpartl commented Jun 15, 2022

Seriously... plx-ht's solution does not work anymore.

using Flux

struct Dataset{T, N} <: AbstractArray{T, N}
    length::Int
end

Dataset{T}(length) where {T} = Dataset{T, 1}(length)

function Base.getindex(d::Dataset{T}, i::Int) where {T}
    image = rand(UInt8, 32, 32) # Just random image.
    target = rand(range(1, 10)) # and random class.
    image, target
end

function Base.getindex(d::Dataset{T}, ids::Array) where {T}
    x, y = d[ids[1]]
    xs_last_dim = ntuple(i -> (), ndims(x))
    ys_last_dim = ntuple(i -> Colon(), ndims(y))

    xs = Array{T}(undef, size(x)..., length(ids))
    ys = Array{T}(undef, size(y)..., length(ids))

    xs[xs_last_dim..., 1] .= x
    ys[ys_last_dim..., 1] .= y

    for (i, id) in enumerate(ids[2:end])
        x, y = d[id]
        xs[xs_last_dim..., i + 1] .= x
        ys[ys_last_dim..., i + 1] .= y
    end
    xs, ys
end

Base.IndexStyle(::Type{Dataset}) = IndexLinear()
Base.size(d::Dataset) = (d.length,)
Base.length(d::Dataset) = d.length


dataset = Dataset{Float32}(10)
loader = Flux.Data.DataLoader(dataset, batchsize=4, shuffle=true)

for (xs, xy) in loader # Fails here!
    println("$i: $(size(xs)) $(size(ys))")
end

Using Flux v0.13.3 this MWE produces following error.

ERROR: MethodError: Cannot `convert` an object of type Tuple{Matrix{UInt8}, Int64} to an object of type Float32

Seems like the DataLoader is trying to setindex! in some temporary Dataset. Any suggestions?

@darsnack
Copy link
Member

Could you post a complete stack trace and the output of Pkg status?

@fpartl
Copy link
Contributor

fpartl commented Jun 16, 2022

Sure, sorry. Now I've removed the ~/.julia directory and created fresh project with Flux only.

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_EDITOR = code

(test) pkg> st
      Status `~/Dokumenty/Projekty/BSU/test/Project.toml`
  [587475ba] Flux v0.13.3

julia> using Flux
       
       struct Dataset{T, N} <: AbstractArray{T, N}
           length::Int
       end
       
       Dataset{T}(length) where {T} = Dataset{T, 1}(length)
       
       ... # same as the code above
       
       dataset = Dataset{Float32}(10)
       loader = Flux.Data.DataLoader(dataset, batchsize=4, shuffle=true)
MLUtils.DataLoader{Dataset{Float32, 1}, Random._GLOBAL_RNG}(Float32[(UInt8[0xdc 0x69 … 0xe9 0xdf; 0x94 0xa1 … 0x8a 0x1e; … ; 0x7d 0x73 … 0x6b 0xbc; 0x74 0x5f … 0xcf 0x7e], 6), (UInt8[0x1b 0x95 … 0x65 0xb6; 0xb7 0x4a … 0x84 0x4b; … ; 0xf9 0xde … 0x99 0x75; 0xe1 0x81 … 0xac 0xfd], 7), (UInt8[0xa2 0x35 … 0x7f 0x2a; 0xc5 0x0d … 0xd7 0x87; … ; 0x81 0xd8 … 0x04 0xc6; 0x6f 0xa5 … 0xaa 0xab], 8), (UInt8[0x26 0x75 … 0xf0 0xf6; 0x93 0x7d … 0xe4 0x49; … ; 0x19 0x4d … 0x60 0x91; 0x79 0x48 … 0x39 0xc0], 5), (UInt8[0x3f 0x19 … 0x5e 0x78; 0x77 0x7c … 0x1c 0x12; … ; 0xa8 0x16 … 0x01 0xa6; 0x93 0x68 … 0x37 0x65], 3), (UInt8[0x99 0x25 … 0x3a 0x2a; 0x6c 0x10 … 0x3b 0xbb; … ; 0x5d 0xde … 0xbf 0x16; 0xe8 0xbe … 0x3c 0xfa], 2), (UInt8[0x3e 0xf4 … 0x3d 0x86; 0xda 0xce … 0x29 0x07; … ; 0x17 0xf6 … 0x09 0xc5; 0xf5 0x09 … 0x1c 0xcb], 5), (UInt8[0x94 0xa2 … 0x88 0x87; 0x8d 0xb1 … 0xfa 0x98; … ; 0xc3 0x2d … 0xc9 0x41; 0x50 0x0c … 0x86 0x88], 4), (UInt8[0xb6 0x8d … 0x15 0xe2; 0xd5 0xa5 … 0x01 0x46; … ; 0x45 0xde … 0x19 0xc2; 0xe5 0x3c … 0xe1 0x2d], 5), (UInt8[0x5f 0x69 … 0x9b 0xc5; 0x7d 0x6f … 0xbd 0xd8; … ; 0x75 0xb4 … 0xbb 0xa7; 0x82 0x19 … 0xe5 0x79], 1)], 4, 10, true, true, Random._GLOBAL_RNG())

julia> for (xs, xy) in loader
           println("$i: $(size(xs)) $(size(ys))")
       end
ERROR: MethodError: Cannot `convert` an object of type Tuple{Matrix{UInt8}, Int64} to an object of type Float32
Closest candidates are:
  convert(::Type{T}, ::LLVM.GenericValue, ::LLVM.LLVMType) where T<:AbstractFloat at ~/.julia/packages/LLVM/WjSQG/src/execution.jl:39
  convert(::Type{T}, ::LLVM.ConstantFP) where T<:AbstractFloat at ~/.julia/packages/LLVM/WjSQG/src/core/value/constant.jl:111
  convert(::Type{T}, ::Static.StaticFloat64) where T<:AbstractFloat at ~/.julia/packages/Static/KC67x/src/float.jl:22
  ...
Stacktrace:
  [1] setindex!(A::Vector{Float32}, x::Tuple{Matrix{UInt8}, Int64}, i1::Int64)
    @ Base ./array.jl:903
  [2] macro expansion
    @ ./multidimensional.jl:867 [inlined]
  [3] macro expansion
    @ ./cartesian.jl:64 [inlined]
  [4] _unsafe_getindex!
    @ ./multidimensional.jl:862 [inlined]
  [5] _unsafe_getindex
    @ ./multidimensional.jl:853 [inlined]
  [6] _getindex
    @ ./multidimensional.jl:839 [inlined]
  [7] getindex
    @ ./abstractarray.jl:1218 [inlined]
  [8] getobs
    @ ~/.julia/packages/MLUtils/OojOS/src/observation.jl:96 [inlined]
  [9] getobs(A::MLUtils.BatchView{SubArray{Float32, 1, Dataset{Float32, 1}, Tuple{Vector{Int64}}, false}, SubArray{Float32, 1, Dataset{Float32, 1}, Tuple{Vector{Int64}}, false}}, i::Int64)
    @ MLUtils ~/.julia/packages/MLUtils/OojOS/src/batchview.jl:105
 [10] (::MLUtils.var"#34#36")(i::Int64)
    @ MLUtils ./none:0
 [11] iterate
    @ ./generator.jl:47 [inlined]
 [12] iterate(d::MLUtils.DataLoader{Datase/home/frantisek/Dokumenty/Projekty/BSU/test/Manifest.tomlt{Float32, 1}, Random._GLOBAL_RNG})
    @ MLUtils ~/.julia/packages/MLUtils/OojOS/src/dataloader.jl:91
 [13] top-level scope
    @ ~/Dokumenty/Projekty/BSU/test/dataloader.jl:42

julia> 

For completeness, I am adding a Manifest and a Project file.
Manifest.toml
Project.toml

@darsnack
Copy link
Member

That's because you made Dataset{Float32} subtype AbstractArray{Float32, 1} but it returns elements of type Tuple for getindex. Something that is a AbstractArray{T, N} should return elements of type T when individually indexed. So the error here is an improper implementation of the array interface.

Presumably, you don't want to actually make your dataset type an array type. You don't need to implement the AbstractArray interface for DataLoader to work, just the indexing interface which is getindex and length.

If you do actually want your type to be an array, then you can implement the array interface correctly, and things should work. Furthermore, if you want your array type to be multidimensional, then you can implement multidimensional getindex as you normally would.

Last thing, since this thread is originally about out of memory data. MLDatasets.jl includes some experimental types to make this simpler. Based on your use-case, you want to use FileDataset. I'm away from the computer right now, but I can post an example snippet implementation later.

@fpartl
Copy link
Contributor

fpartl commented Jun 16, 2022

I understand this error message. I'm just saying that plx-ht's solution no longer works and lazy loading has to be done after the batch (containing only metadata of samples) is returned by the loader in training loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests