Add pytorch style dataloader #463

dayo05 · 2021-11-27T18:49:22Z

In pytorch there are DataLoader on torch.utils.data, Torchsharp has DataIterator on TorchSharp.Data.

In pytorch, Creating Dataset is easy.

class CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path)
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

This is sample code that I found on pytorch official tutorial.
But in DataIterator, I don't know how to use that, It is not similar as pytorch style, I cannot found tutorial about it.

So I make dataloader using like this.
This style is very similar as what pytorch does.

This will make iterating data more easier.

nietras · 2021-11-29T11:34:44Z

This is "hard-coded" to only two tensors which doesn't fit general case scenarios. We have scenarios when many input and many output tensors. In python this is handled by the dynamic type system and being able to handle tuples etc. I think :) This cannot be mirrored with a good experience in C# in my view. The best generalization is to base this on Dictionary<string, Tensor> (or Dictionary<object, Tensor>) if generalizing key. Or we need to go down a generic, reflection based path which is problematic in my view.

Sorry, editing old comment, since DataLoader is present here but wanted to say that, since we need to batch we need to be able to iterator over Tensors and produce the same structured output which is why Dictionary<string, Tensor> seems apt.

However, we would then also very much like to be able to run data loading threaded, which involves handling reproducibility in the face of global variables, ensuring randomized sampling is reproducible. So shuflling should be fully reproducible and seeded. Hence, random created at ctor. While we are at then, do fischer-yates shuffle and not use object index, but use statically typed stuff. Hence, it should be GetTensor(int index). Or this could probably be Dictionary<string, Tensor> Get(int index) or even just an indexer.

I also don't think Dataset should have a GetDataEnumerable(), Dataset should not be iterable, but only be indexable. Since that is how DataLoader can do parallel, shuffled loading.

I had been thinking about adding this myself, so very nice to see it being worked on 👍 However, as I see there are some fundamental issues that need to be addressed regarding multi-threading loading and reproducibility that need to be taken into account.

nietras

I've made some review comments too. :) I have more but just some quick remarks.

src/TorchSharp/Data/Dataset.cs

src/TorchSharp/Data/DataLoader.cs

dayo05 · 2021-11-30T17:40:43Z

TODO: Write test about it

src/TorchSharp/Data/Dataset.cs

NiklasGustafsson · 2021-11-30T17:53:24Z

This is such an important addition, and we must make sure to get it right. I would like to see some more extensive code examples on how to use this -- the user experience is essential.

dayo05 · 2021-11-30T18:13:05Z

I'm making a new example for reflection about review

NiklasGustafsson · 2021-12-01T17:29:58Z

I'm going to add a few things to the code comments in a review, but I have some questions:

Should we allow data loaders to accept custom shufflers? That is, if I think (I'd probably be wrong) that I have a better shuffle algorithm, shouldn't I be allowed to use that, instead?
There's no way to specify the device on which the 'Current' tensors end up.
Why does 'Current' return a dictionary? Is it input vs. labels?
How do I split a data set into train, test, and validation data subsets?
How do we add data augmentation support?
When the dataset is not shuffled, we're still creating fresh batches every time through the data set. It would be great to think of some way to keep data in memory between epochs, as long as it fits, of course.

Also, it would be great to demonstrate how to use this by converting the Examples to use this instead of the ad hoc loading I added earlier.

src/TorchSharp/DataLoader.cs

src/TorchSharp/Utils/ShuffleGenerator.cs

src/TorchSharp/DataLoader.cs

src/TorchSharp/Dataset.cs

src/TorchSharp/DataLoader.cs

NiklasGustafsson · 2021-12-03T21:08:51Z

@dayo05 -- thanks for all the work so far. I'll be taking some time off in December, but I'm looking forward to integrating this when I come back.

Thanks,

Niklas

dayo05 · 2021-12-10T10:53:58Z

I'm busy now, I'll going to work on two weeks later :(

dayo05 · 2021-12-25T17:20:55Z

@NiklasGustafsson I finished all my job, review please.

dayo05 · 2021-12-25T19:26:40Z

I found shuffler is not correctly working. I'll fix ASAP.

dayo05 · 2021-12-25T19:30:50Z

This is example for classifying fruits360 dataset with imagesharp. Now shuffler is not working fine so loss is big but if I use previous shuffler(which before making shufflegenerator class), It works fine.

public class Fruits360: Dataset
{
    private List<string> Labels = new();
    private List<string> images = new();
    public Fruits360(bool isTrain, Device device)
    {
        var root = "/home/dayo/datasets/fruits-360/" + (isTrain ? "Training" : "Test");
        //Labels.AddRange(Directory.GetDirectories(root));
        foreach(var x in Directory.GetDirectories(root))
            images.AddRange(Directory.GetFiles(x));
        Labels.AddRange(Directory.GetDirectories(root).Select(x => x.Split('/')[^1]));
    }

    public override long Count => images.Count;

    public override Dictionary<string, torch.Tensor> GetTensor(long index)
    {
        var image = Image.Load<Rgb24>(images[(int) index], new JpegDecoder());
        using var r = tensor(image.GetPixelMemoryGroup()[0].Span.ToArray().Select(x => x.R / 255.0f).ToList(),
            new long[] {1, 100, 100});
        using var g = tensor(image.GetPixelMemoryGroup()[0].Span.ToArray().Select(x => x.G / 255.0f).ToList(),
            new long[] {1, 100, 100});
        using var b = tensor(image.GetPixelMemoryGroup()[0].Span.ToArray().Select(x => x.B / 255.0f).ToList(),
            new long[] {1, 100, 100});
        return new()
        {
            {"image", cat(new List<Tensor> {r, g, b}, 0)},
            {"label", tensor(Labels.IndexOf(images[(int)index].Split('/')[^2]), ScalarType.Int64)}
        };
    }
}

using var trainDataset = new Fruits360(true, CUDA);
using var testDataset = new Fruits360(false, CUDA);
using var train = new DataLoader(trainDataset, 256, true, CUDA);
using var test = new DataLoader(testDataset, 512, false, CUDA);

var model = new Fruits360Model(CUDA);
var optimizer = optim.Adam(model.parameters(), learningRate: 0.01);

foreach (var epoch in Range(1, 1000))
{
    model.Train();
    var batchId = 1;
    Console.WriteLine($"Epoch{epoch} running");
    foreach (var x in train)
    {
        optimizer.zero_grad();

        var prediction = model.forward(x["image"]);
        var output = functional.nll_loss(reduction: Reduction.Mean)(prediction, x["label"]);
        
        output.backward();
        optimizer.step();
        
        Console.Write($"\rTrain: epoch {epoch} {batchId * 1.0 / train.Count:P2} [{batchId} / {train.Count}] Loss: {output.ToSingle():F9}");
        batchId++;
        
        prediction.Dispose();
        output.Dispose();
        GC.Collect();
    }

    using (no_grad())
    {
        model.Eval();
        
        var testLoss = 0.0;
        var correct = 0;
        var total = 0L;
        var idx = 0;
        foreach (var x in test)
        {
            idx++;
            Console.Write($"\rTest running: {idx * 1.0 / test.Count:P2}");
            var prediction = model.forward(x["image"]);
            var output = functional.nll_loss(reduction: Reduction.Sum)(prediction, x["label"]);
            testLoss += output.ToSingle();

            var pred = prediction.argmax(1);
            total += pred.size()[0];
            correct += pred.eq(x["label"]).sum().ToInt32();
            pred.Dispose();
            prediction.Dispose();
            output.Dispose();
            GC.Collect();
        }
        Console.WriteLine(
            $"\rTest set: Average loss {(testLoss / testDataset.Count):F9} | Accuracy {((double) correct / testDataset.Count):P2}");
    }
}


class Fruits360Model : Module
{
    private Module layer1 = Sequential(
        Conv2d(3, 32, 3),
        ReLU(),
        MaxPool2d(2, 2));

    private Module layer2 = Sequential(
        Conv2d(32, 64, 3),
        ReLU(),
        MaxPool2d(2, 2));

    private Module layer3 = Sequential(
        Conv2d(64, 64, 3),
        ReLU(),
        MaxPool2d(2, 2));

    private Module fc = Sequential(
        Flatten(),
        Linear(6400, 1024),
        ReLU(),
        Dropout(),
        Linear(1024, 625),
        ReLU(),
        Linear(625, 131));
    public Fruits360Model(Device? device) : base("fruits360")
    {
        RegisterComponents();
        to(device ?? CPU);
    }

    public override Tensor forward(Tensor t)
    {
        t = layer1.forward(t);
        t = layer2.forward(t);
        t = layer3.forward(t);
        t = fc.forward(t);
        return LogSoftmax(0).forward(t);
    }
}

dayo05 · 2021-12-27T06:40:40Z

I found that this shuffler is depend on seed. So, I'll change to other shuffler and enable to use custom shuffler with implementation of IEnumerable.

NiklasGustafsson · 2022-01-07T16:54:21Z

@dayo05 -- A couple of comments:

This looks really good now. I will be happy to approve this later.
There seems to be some problem with the Azure pipelines that are doing the builds for MacOS. We'll have to wait for that to go away before merging.
Before we merge, I would like to see at least one of the examples in this repo modified to use this API instead of the version I put together for temporary purposes.

dayo05 · 2022-01-07T18:19:24Z

I make fisher yates shuffler as default because I thought that default value is friendly for beginner and previous shuffler is not good for beginner who has less amount of dataset. But, I left previous shuffler and allow to call like this

var data = new DataLoader(dataset, batchSize, new FastShuffler(dataset.Count), torch.CUDA);

dayo05 · 2022-01-07T19:23:35Z

https://github.com/dayo05/DataLoaderExample/tree/master
I write example usage for this commit
See Fruits360.cs, Program.cs

NiklasGustafsson · 2022-01-07T19:32:49Z

https://github.com/dayo05/DataLoaderExample/tree/master I write example usage for this commit See Fruits360.cs, Program.cs

That's great. Could you take the MNIST example in this repo and convert it to use your API?

In previous code, dataset is able to dispose twice

dayo05 · 2022-01-10T18:24:10Z

@NiklasGustafsson I updated here https://github.com/dayo05/DataLoaderExample/tree/master
Exist on MNIST.cs, MNISTReader.cs
I edited from torchsharp main repo source

NiklasGustafsson · 2022-01-11T17:48:21Z

I've been trying to restart the MacOS builds for a couple of days now. I have no idea why there are failing, but it is very early, way before it gets to building TorchSharp.

Update: @dayo05 -- the 'main' branch still builds just fine, so there must be something in your PR. I suspect it may have something to do with your changing the SDK version number, but I'm not sure.

dayo05 · 2022-01-12T17:56:47Z

@NiklasGustafsson I fix it

NiklasGustafsson · 2022-01-12T18:10:41Z

Okay, I'm going to merge this now. Let's still get some of the examples in this repository using the DataLoader API.

Add pytorch style dataloader

7360f91

nietras reviewed Nov 29, 2021

View reviewed changes

dayo05 added 9 commits November 30, 2021 22:21

Remove GetDataEnumerable from interface

3de29c3

Resolve review except get random value

857f9e6

Rename method and create reset method

954fa8c

Add copyright string

63568f9

Use new shuffle algorithm

d8d249f

Add summery

ab2eb09

Make able to create non-shuffle dataloader

ca49340

Make able to create non-shuffle dataset

2cc2300

Change tensor tuple to dictionary

4cff460

Merge branch 'main' into main

124790d

NiklasGustafsson reviewed Nov 30, 2021

View reviewed changes

src/TorchSharp/Data/Dataset.cs Outdated Show resolved Hide resolved

NiklasGustafsson suggested changes Nov 30, 2021

View reviewed changes

src/TorchSharp/Data/Dataset.cs Outdated Show resolved Hide resolved

dayo05 added 2 commits December 1, 2021 03:08

Replace files and make dataset abstract class

17a9022

Merge remote-tracking branch 'origin/main'

d9dddb8

dayo05 and others added 8 commits December 1, 2021 03:27

Merge branch 'dotnet:main' into main

07d2689

Merge branch 'main' into main

2315e66

Make dataloader disposable

ab6bd3e

Make count priority abstract

9631ed6

Make dataloader to stack data as end of tensor

43dfcc1

Create simple test for dataset and dataloader

225b9f8

Merge remote-tracking branch 'origin/main'

a44ba7d

Make dispose enumerator

6e336ca

NiklasGustafsson suggested changes Dec 1, 2021

View reviewed changes

dayo05 and others added 2 commits December 3, 2021 16:41

Create test for custom seed

921bb3d

Merge branch 'main' into main

8733f10

dayo05 added 2 commits December 4, 2021 08:07

Make dataloader tensor dispose on MoveNext or Reset

168f87c

Change GCD algorithm

4db9b64

dayo05 and others added 5 commits December 25, 2021 19:11

Merge branch 'dotnet:main' into main

5315175

Added document comments

eab28eb

Add document comment for classes

0deaddb

Make catenate every tensor once

e9c20a4

Update doc comment

cc5dfe0

Make able to set custom shuffler

dfff08a

dayo05 and others added 4 commits January 8, 2022 12:01

Fix mistake on creating custom shuffler

9dffab6

Add fisher yates shuffler and make that as default

991c377

Fix mistake on shuffler

2efce39

Make dispose dataset once

00e16ad

In previous code, dataset is able to dispose twice

Undo changes on global.json

9291dc4

NiklasGustafsson merged commit 25d6ad2 into dotnet:main Jan 12, 2022

Add pytorch style dataloader #463

Add pytorch style dataloader #463

Uh oh!

Conversation

dayo05 commented Nov 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nietras commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nietras left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dayo05 commented Nov 30, 2021

Uh oh!

Uh oh!

Uh oh!

NiklasGustafsson commented Nov 30, 2021

Uh oh!

dayo05 commented Nov 30, 2021

Uh oh!

NiklasGustafsson commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NiklasGustafsson commented Dec 3, 2021

Uh oh!

dayo05 commented Dec 10, 2021

Uh oh!

dayo05 commented Dec 25, 2021

Uh oh!

dayo05 commented Dec 25, 2021

Uh oh!

dayo05 commented Dec 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dayo05 commented Dec 27, 2021

Uh oh!

NiklasGustafsson commented Jan 7, 2022

Uh oh!

dayo05 commented Jan 7, 2022

Uh oh!

dayo05 commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NiklasGustafsson commented Jan 7, 2022

Uh oh!

dayo05 commented Jan 10, 2022

Uh oh!

NiklasGustafsson commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dayo05 commented Jan 12, 2022

Uh oh!

NiklasGustafsson commented Jan 12, 2022

Uh oh!

Uh oh!

dayo05 commented Nov 27, 2021 •

edited

Loading

nietras commented Nov 29, 2021 •

edited

Loading

NiklasGustafsson commented Dec 1, 2021 •

edited

Loading

dayo05 commented Dec 25, 2021 •

edited

Loading

dayo05 commented Jan 7, 2022 •

edited

Loading

NiklasGustafsson commented Jan 11, 2022 •

edited

Loading