Description
System information
- OS version/distro: macOS 10.14.6
- .NET Version (eg., dotnet --info): 5.0.101
Issue
SVMLightLoader dies if when you load >128 dense rows.
When the feature column sparsity is >0.25, internally the column is represented in sparse format, else dense. SVMLightLoader works if either the column is sparse (many missing values), or if the number of rows is < 128.
Error
Fails with one of three errors: (dataset dependent)
-
System.InvalidOperationException: Duplicate keys found in dataset
-
System.ArgumentException: Destination is too short. (Parameter 'destination')
-
System.IndexOutOfRangeException: Index was outside the bounds of the array.
Stack trace:
Unhandled exception. System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data ---> System.InvalidOperationException: Duplicate keys found in dataset at Microsoft.ML.Data.SvmLightLoader.OutputMapper.MapCore(VBuffer`1& keys, VBuffer`1& values, Output output) at Microsoft.ML.Data.SvmLightLoader.OutputMapper.Map(IntermediateOut intermediate, Output output) at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass5_0.<Microsoft.ML.Data.IRowMapper.CreateGetters>b__0() at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass6_0`1.<GetDstGetter>b__0(T& dst) at Microsoft.ML.Data.DataViewUtils.Splitter.InPipe.Impl`1.Fill() at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_1.<ConsolidateCore>b__2() --- End of inner exception stack trace --- at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes) at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore() at Microsoft.ML.Data.RootCursorBase.MoveNext() at Microsoft.ML.Data.SynchronizedCursorBase.MoveNext() at SVMLightLoaderTest.Program.PrintData(IDataView svmData) in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 121 at SVMLightLoaderTest.Program.Main() in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 45
Points to:
machinelearning/src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs
Lines 364 to 388 in 5dbfd8a
Side note: It looks like Visual Studio on MacOS is not loading the symbols (or source) for ML․NET.
Source code / logs
Repro:
- .NET Fiddle -- https://dotnetfiddle.net/WbKlzS
- Visual Studio Solution: SVMLightLoaderTest.zip
Bug exists in ML․NET v1.5.0 to v.1.5.4 (current). SvmLightLoader was added in v1.5.0.
Background
I was attempting to run AutoML․NET on a SVM Light dataset (download) using the CLI. But we lack SVM Light support in AutoML․NET, so I was attempting to convert the SVM Light file to a sparse TSV. The goal was to have AutoML․NET read the converted sparse TSV file, but the conversion failed.
Using MAML in v1.5.4: (fails)
dotnet ./bin/AnyCPU.Release/Microsoft.ML.Console/netcoreapp2.1/MML.dll SaveData data=Day0.svm loader=SvmLightLoader{} xf=SelectColumns{keep=Label keep=Features} saver=Text{schema=- dense=-} dout=Day0.tsv
This fails with the above errors, as the current SvmLightLoader fails.
Using TLC's MAML: (works)
maml.exe SaveData data=Day0.svm loader=SvmLightLoader{} xf=KeepColumns{col=Label col=Features} saver=Text{schema=- dense=-} dout=Day0.tsv
The old internal version of ML․NET (TLC) works properly in reading the SVM Light format and writing a TSV. The implies there was a bug introduced when we released SvmLightLoader with v1.5.0 of ML․NET.