-
Notifications
You must be signed in to change notification settings - Fork 60
Description
As part of #702 's "Fix TextLoader" task, I was looking into improving the support of ML.NET's TextLoader for CSV, and I opened this PR dotnet/machinelearning#5125. As explained there, ML.NET actually already supports loading regular CSV files, and the only thing that it couldn't load were new line characters inside quoted fields (which is fixed by the PR).
So, from the datasets mentioned in #702 under "Fix TextLoader", it turns out ML.NET couldn't open some datasets because of those new lines inside quoted fields: some of the Airbnb datasets and some of the jigsaw datasets.
On the other hand, ML.NET was able to load all the other datasets even without the fix on my PR, i.e.:
- The remaining datasets from Airbnb (including the one from TSV Dataset with commas doesn't work #327 )and jigsaw
text-emotion(EDIT: I can actually load this file in ML.NET AND in ModelBuilder, so I don't know why was this included in ML.NET Asks [Tracking] #702).- sentiment140
- titanic train.csv (from CSV or TSV Dataset with double quotes doesn't work #452)
Despite this, ModelBuilder is unable to use them, having the following error message:
Unrecognized data format. Please check the input file to make sure it is a valid comma or tab separated file
at Microsoft.ML.ModelBuilder.DataSources.FileDataSource.GetCorrectDelimiter(String selectedFileName)
at Microsoft.ML.ModelBuilder.DataSources.FileDataSource.GetListOfColumns(String selectedFileName)
at Microsoft.ML.ModelBuilder.ToolWindows.DataTabDataContext.GetDataLoadDimensions()
at Microsoft.ML.ModelBuilder.ToolWindows.TextDataControl.SelectFileButton_Click(Object sender, RoutedEventArgs e)
These datasets don't include new lines inside quoted fields, so this is another issue.
After experimenting around, I realized that by deleting the commas inside quoted fields only in the first line (after the header), ModelBuilder was able to load the file and work with it (even if the other lines had these kind of commas).
After getting the output code from ModelBuilder, I ran the training code but using the original datasets (without the deleted commas), and it all worked fine in ML.NET. This worked even without the changes on my PR, so this means that the problem has never been in ML.NET's TextLoader.
I guess the problem is in ModelBuilder (or perhaps in AutoML.NET?) somewhere where the format of the file is checked only by looking at the first row, and it makes the mistake of thinking that commas inside a quoted fields are somehow invalid.
Please, let me know if there are still reasons to believe that this is a problem outside ModelBuilder/AutoML.NET (and perhaps particularly in TextLoader), so that I can try to look into it asap. Thanks! 😄