Description
System information
- ML.NET - 1.2.0:
Issue
There should be a way to append or concatenate multiple IDataViews together.
Here's the scenario:
The new ranking sample needs the ability to train the model using two datasets that are each loaded from a separate text file and have the same schema - specifically, there is a (1) Training dataset and (2) Validation dataset, that need to be combined. For example, refer to step #3 in the steps outlined below which the sample is based on.
Here's the steps shown in the sample - generally, the pattern to train, validate, and test a model includes the following steps:
- The model is trained on the training dataset. The model's metrics are then evaluated using the validation dataset.
- Step Get a working build #1 is repeated by retraining and reevaluating the model until the desired metrics are achieved. The outcome of this step is a pipeline that applies the necessary data transformations and trainer.
- The pipeline is used to train on the combined training + validation datasets. The model's metrics are then evaluated on the testing dataset (exactly once) -- this is the final set of metrics used to measure the model's quality.
- The final step is to retrain the pipeline on all of the combined training + validation + testing datasets. This model is then ready to be deployed into production.
Today to achieve this, the sample has to first load the data from a text file, then create an enumerable so that the datasets can be concatenated - this process would be greatly simplified if you could append/concatenate two IDataViews together:
//Load training data (has a header)
IDataView trainData = mlContext.Data.LoadFromTextFile<SearchResultData>(TrainDatasetPath, separatorChar: '\t', hasHeader: true);
//Load validation data (has a header)
IDataView validationData = mlContext.Data.LoadFromTextFile<SearchResultData>(ValidationDatasetPath, separatorChar: '\t', hasHeader: false);
// Combine the training and validation datasets.
var validationDataEnum = mlContext.Data.CreateEnumerable<SearchResultData>(validationData, false);
var trainDataEnum = mlContext.Data.CreateEnumerable<SearchResultData>(trainData, false);
var trainValidationDataEnum = validationDataEnum.Concat<SearchResultData>(trainDataEnum);
IDataView trainValidationData = mlContext.Data.LoadFromEnumerable<SearchResultData>(trainValidationDataEnum);
NOTE: I also considered creating a text loader to load multiple text files (as described [here])(https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.textloader.load?view=ml-dotnet#Microsoft_ML_Data_TextLoader_Load_Microsoft_ML_Data_IMultiStreamSource_); however, one of the data files included a header while the other didn't. It looks like to create a TextLoader for multiple files, that the file headers must be consistent across files.
Source code / logs
Note that there is a method today that provides the ability to append rows - we should consider exposing this publicly:
machinelearning/src/Microsoft.ML.Data/DataView/AppendRowsDataView.cs
Lines 23 to 31 in 70ef7ec