Skip to content

Provide a way to append\concatentate multiple IDataViews #4005

Open

Description

System information

  • ML.NET - 1.2.0:

Issue

There should be a way to append or concatenate multiple IDataViews together.

Here's the scenario:
The new ranking sample needs the ability to train the model using two datasets that are each loaded from a separate text file and have the same schema - specifically, there is a (1) Training dataset and (2) Validation dataset, that need to be combined. For example, refer to step #3 in the steps outlined below which the sample is based on.

Here's the steps shown in the sample - generally, the pattern to train, validate, and test a model includes the following steps:

  1. The model is trained on the training dataset. The model's metrics are then evaluated using the validation dataset.
  2. Step Get a working build #1 is repeated by retraining and reevaluating the model until the desired metrics are achieved. The outcome of this step is a pipeline that applies the necessary data transformations and trainer.
  3. The pipeline is used to train on the combined training + validation datasets. The model's metrics are then evaluated on the testing dataset (exactly once) -- this is the final set of metrics used to measure the model's quality.
  4. The final step is to retrain the pipeline on all of the combined training + validation + testing datasets. This model is then ready to be deployed into production.

Today to achieve this, the sample has to first load the data from a text file, then create an enumerable so that the datasets can be concatenated - this process would be greatly simplified if you could append/concatenate two IDataViews together:


//Load training data (has a header)
IDataView trainData = mlContext.Data.LoadFromTextFile<SearchResultData>(TrainDatasetPath, separatorChar: '\t', hasHeader: true);

//Load validation data (has a header)
IDataView validationData = mlContext.Data.LoadFromTextFile<SearchResultData>(ValidationDatasetPath, separatorChar: '\t', hasHeader: false);

// Combine the training and validation datasets.
var validationDataEnum = mlContext.Data.CreateEnumerable<SearchResultData>(validationData, false);
var trainDataEnum = mlContext.Data.CreateEnumerable<SearchResultData>(trainData, false);
var trainValidationDataEnum = validationDataEnum.Concat<SearchResultData>(trainDataEnum);
IDataView trainValidationData = mlContext.Data.LoadFromEnumerable<SearchResultData>(trainValidationDataEnum);

NOTE: I also considered creating a text loader to load multiple text files (as described [here])(https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.textloader.load?view=ml-dotnet#Microsoft_ML_Data_TextLoader_Load_Microsoft_ML_Data_IMultiStreamSource_); however, one of the data files included a header while the other didn't. It looks like to create a TextLoader for multiple files, that the file headers must be consistent across files.

Source code / logs

Note that there is a method today that provides the ability to append rows - we should consider exposing this publicly:

/// <summary>
/// This class provides the functionality to combine multiple IDataView objects which share the same schema
/// All sources must contain the same number of columns and their column names, sizes, and item types must match.
/// The row count of the resulting IDataView will be the sum over that of each individual.
///
/// An AppendRowsDataView instance is shuffleable iff all of its sources are shuffleable and their row counts are known.
/// </summary>
[BestFriend]
internal sealed class AppendRowsDataView : IDataView

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions