Skip to content

SVMLightLoader Fails above 128 dense rows #5566

Open

Description

System information

  • OS version/distro: macOS 10.14.6
  • .NET Version (eg., dotnet --info): 5.0.101

Issue

SVMLightLoader dies if when you load >128 dense rows.

When the feature column sparsity is >0.25, internally the column is represented in sparse format, else dense. SVMLightLoader works if either the column is sparse (many missing values), or if the number of rows is < 128.

Error

Fails with one of three errors: (dataset dependent)

  • System.InvalidOperationException: Duplicate keys found in dataset

  • System.ArgumentException: Destination is too short. (Parameter 'destination')

  • System.IndexOutOfRangeException: Index was outside the bounds of the array.

Stack trace:

Unhandled exception. System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data
 ---> System.InvalidOperationException: Duplicate keys found in dataset
   at Microsoft.ML.Data.SvmLightLoader.OutputMapper.MapCore(VBuffer`1& keys, VBuffer`1& values, Output output)
   at Microsoft.ML.Data.SvmLightLoader.OutputMapper.Map(IntermediateOut intermediate, Output output)
   at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass5_0.<Microsoft.ML.Data.IRowMapper.CreateGetters>b__0()
   at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass6_0`1.<GetDstGetter>b__0(T& dst)
   at Microsoft.ML.Data.DataViewUtils.Splitter.InPipe.Impl`1.Fill()
   at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_1.<ConsolidateCore>b__2()
   --- End of inner exception stack trace ---
   at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
   at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
   at Microsoft.ML.Data.RootCursorBase.MoveNext()
   at Microsoft.ML.Data.SynchronizedCursorBase.MoveNext()
   at SVMLightLoaderTest.Program.PrintData(IDataView svmData) in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 121
   at SVMLightLoaderTest.Program.Main() in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 45

Points to:

private void MapCore(ref VBuffer<uint> keys, ref VBuffer<float> values, Output output)
{
Contracts.Check(keys.Length == values.Length, "number of keys does not match number of values.");
// Both of these inputs should be dense, but still work even if they're not.
VBufferUtils.Densify(ref keys);
VBufferUtils.Densify(ref values);
var keysValues = keys.GetValues();
var valuesValues = values.GetValues();
// The output vector could be sparse, so we use BufferBuilder here.
_bldr.Reset((int)_keyMax, false);
_indexUsed.SetAll(false);
for (int i = 0; i < keys.Length; ++i)
{
var key = keysValues[i];
if (key == 0 || key > _keyMax)
continue;
if (_indexUsed[(int)key - 1])
throw Contracts.Except("Duplicate keys found in dataset");
_bldr.AddFeature((int)key - 1, valuesValues[i]);
_indexUsed[(int)key - 1] = true;
}
_bldr.GetResult(ref output.Features);
}

Side note: It looks like Visual Studio on MacOS is not loading the symbols (or source) for ML․NET.

Source code / logs

Repro:

Bug exists in ML․NET v1.5.0 to v.1.5.4 (current). SvmLightLoader was added in v1.5.0.

Background

I was attempting to run AutoML․NET on a SVM Light dataset (download) using the CLI. But we lack SVM Light support in AutoML․NET, so I was attempting to convert the SVM Light file to a sparse TSV. The goal was to have AutoML․NET read the converted sparse TSV file, but the conversion failed.

Using MAML in v1.5.4: (fails)
dotnet ./bin/AnyCPU.Release/Microsoft.ML.Console/netcoreapp2.1/MML.dll SaveData data=Day0.svm loader=SvmLightLoader{} xf=SelectColumns{keep=Label keep=Features} saver=Text{schema=- dense=-} dout=Day0.tsv

This fails with the above errors, as the current SvmLightLoader fails.

Using TLC's MAML: (works)
maml.exe SaveData data=Day0.svm loader=SvmLightLoader{} xf=KeepColumns{col=Label col=Features} saver=Text{schema=- dense=-} dout=Day0.tsv

The old internal version of ML․NET (TLC) works properly in reading the SVM Light format and writing a TSV. The implies there was a bug introduced when we released SvmLightLoader with v1.5.0 of ML․NET.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    P1Priority of the issue for triage purpose: Needs to be fixed soon.bugSomething isn't workingloadsaveBugs related loading and saving data or models

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions