Skip to content

Dataset files get written to the wrong directory #1105

Closed
@Documidas

Description

@Documidas

Looking at the documentation here:

/// <summary>
/// MNIST Dataset
/// http://yann.lecun.com/exdb/mnist/
/// </summary>
/// <param name="root">Root directory of dataset where the MNIST .gz data files exist.</param>
/// <param name="train">If true, creates dataset from the 'train' files, otherwise from the 't10k' files.</param>
/// <param name="download">
/// If true, downloads the dataset from the internet and puts it in root directory.
/// If the dataset is already downloaded, it is not downloaded again.
/// </param>
/// <param name="target_transform">A function/transform that takes in the target and transforms it.</param>
/// <returns>An iterable dataset.</returns>
public static Dataset MNIST(string root, bool train, bool download = false, torchvision.ITransform target_transform = null)

The expected behavior is that if I pass foobar/ as the value for root, the dataset should be downloaded to foobar/mnist/some-archive-name.tar.gz. The actual behavior is that the dataset archive gets downloaded into the current working directory, then the method throws an exception because it fails to find the archive file at foobar/mnist/some-archive-name.tar.gz.

The bug is due to a small typo in the DownloadFile() method, found here:

protected void DownloadFile(string file, string target, string baseUrl)
{
var filePath = JoinPaths(target, file);
var netPath = baseUrl.EndsWith('/') ? $"{baseUrl}{file}" : $"{baseUrl}/{file}";
if (!File.Exists(filePath)) {
lock (_httpClient) {
using var s = _httpClient.GetStreamAsync(netPath).Result;
using var fs = new FileStream(file, FileMode.CreateNew);
s.CopyToAsync(fs).Wait();
}
}
}

On line 95, DownloadFile() writes to the path stored in file, whereas it should be writing to the path stored in filePath.

DownloadFile() is called here:

DownloadFile("train-images-idx3-ubyte.gz", sourceDir, baseUrl);

Assuming Download() was called with root = "foobar" and dataset = "mnist":

  1. Download() sets datasetPath = "foobar/mnist" and sourceDir = datasetPath
  2. DownloadFile() will be called with file = "train-images-idx3-ubyte.gz" and target = sourceDir = "foobar/mnist"
  3. Because of the typo, DownloadFile() writes to train-images-idx3-ubyte.gz instead of foobar/mnist/train-images-idx3-ubyte.gz
  4. DecompressFile() is later called with file = "train-images-idx3-ubyte" and sourceDir = "foobar/mnist"
  5. DecompressFile() expects the target file to be at foobar/mnist/train-images-idx3-ubyte.gz, but the file is actually at $(pwd)/train-images-idx3-ubyte.gz

I do not have a CLA signed with Microsoft, the .NET Foundation, or the TorchSharp repository, so I'm reporting this issue as-is rather than with a pull request. From what I've found, I believe this bug can be fixed with a 4 letter change that replaces file with filePath in the DownloadFile() method.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions