Skip to content

Provide a way to append\concatentate multiple IDataViews #4005

Open
@nicolehaugen

Description

@nicolehaugen

System information

  • ML.NET - 1.2.0:

Issue

There should be a way to append or concatenate multiple IDataViews together.

Here's the scenario:
The new ranking sample needs the ability to train the model using two datasets that are each loaded from a separate text file and have the same schema - specifically, there is a (1) Training dataset and (2) Validation dataset, that need to be combined. For example, refer to step #3 in the steps outlined below which the sample is based on.

Here's the steps shown in the sample - generally, the pattern to train, validate, and test a model includes the following steps:

  1. The model is trained on the training dataset. The model's metrics are then evaluated using the validation dataset.
  2. Step Get a working build #1 is repeated by retraining and reevaluating the model until the desired metrics are achieved. The outcome of this step is a pipeline that applies the necessary data transformations and trainer.
  3. The pipeline is used to train on the combined training + validation datasets. The model's metrics are then evaluated on the testing dataset (exactly once) -- this is the final set of metrics used to measure the model's quality.
  4. The final step is to retrain the pipeline on all of the combined training + validation + testing datasets. This model is then ready to be deployed into production.

Today to achieve this, the sample has to first load the data from a text file, then create an enumerable so that the datasets can be concatenated - this process would be greatly simplified if you could append/concatenate two IDataViews together:


//Load training data (has a header)
IDataView trainData = mlContext.Data.LoadFromTextFile<SearchResultData>(TrainDatasetPath, separatorChar: '\t', hasHeader: true);

//Load validation data (has a header)
IDataView validationData = mlContext.Data.LoadFromTextFile<SearchResultData>(ValidationDatasetPath, separatorChar: '\t', hasHeader: false);

// Combine the training and validation datasets.
var validationDataEnum = mlContext.Data.CreateEnumerable<SearchResultData>(validationData, false);
var trainDataEnum = mlContext.Data.CreateEnumerable<SearchResultData>(trainData, false);
var trainValidationDataEnum = validationDataEnum.Concat<SearchResultData>(trainDataEnum);
IDataView trainValidationData = mlContext.Data.LoadFromEnumerable<SearchResultData>(trainValidationDataEnum);

NOTE: I also considered creating a text loader to load multiple text files (as described [here])(https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.textloader.load?view=ml-dotnet#Microsoft_ML_Data_TextLoader_Load_Microsoft_ML_Data_IMultiStreamSource_); however, one of the data files included a header while the other didn't. It looks like to create a TextLoader for multiple files, that the file headers must be consistent across files.

Source code / logs

Note that there is a method today that provides the ability to append rows - we should consider exposing this publicly:

/// <summary>
/// This class provides the functionality to combine multiple IDataView objects which share the same schema
/// All sources must contain the same number of columns and their column names, sizes, and item types must match.
/// The row count of the resulting IDataView will be the sum over that of each individual.
///
/// An AppendRowsDataView instance is shuffleable iff all of its sources are shuffleable and their row counts are known.
/// </summary>
[BestFriend]
internal sealed class AppendRowsDataView : IDataView

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions