Skip to content

Adding KeyToValueTransformers before finishing exporting to ONNX #4841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 25, 2020
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions src/Microsoft.ML.OnnxConverter/SaveOnnxCommand.cs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
using Microsoft.ML.Internal.Utilities;
using Microsoft.ML.Model.OnnxConverter;
using Microsoft.ML.Runtime;
using Microsoft.ML.Transforms;
using Newtonsoft.Json;
using static Microsoft.ML.Model.OnnxConverter.OnnxCSharpToProtoWrapper;

Expand Down Expand Up @@ -241,6 +242,52 @@ private static void AddSlotNames(OnnxContextImpl ctx, DataViewSchema.Column colu
ctx.AddOutputVariable(NumberDataViewType.Int64, labelEncoderOutput);
}

// Checks if a column has KeyValues Annotations of any type,
// So to know if it is safe to use KeyToValue Transformer on it.
private bool HasKeyValues(DataViewSchema.Column column)
{
if (column.Type.GetItemType() is KeyDataViewType keyType)
{
var metaColumn = column.Annotations.Schema.GetColumnOrNull(AnnotationUtils.Kinds.KeyValues);
return metaColumn != null &&
metaColumn.Value.Type is VectorDataViewType vectorType &&
keyType.Count == (ulong)vectorType.Size;
}

return false;
}

// Get the names of the KeyDataViewType columns that aren't affected by the pipeline that is being exported to ONNX.
private HashSet<string> GetPassThroughKeyDataViewTypeColumnsNames(IDataView source, IDataView end)
{
var inputKeyDataViewTypeColumnsNames = new HashSet<string>();
foreach (var col in source.Schema)
if (col.IsHidden == false && HasKeyValues(col))
inputKeyDataViewTypeColumnsNames.Add(col.Name);

var passThroughColumnNames = new HashSet<string>();
var outputColumnNames = new HashSet<string>();
foreach (var col in end.Schema)
{
if (outputColumnNames.Contains(col.Name))
{
// "Pass through" column names appear only once in the output schema
passThroughColumnNames.Remove(col.Name);
}
else
{
// We are only interested in the KeyDataViewType outpus columns
if (col.IsHidden == false && HasKeyValues(col))
passThroughColumnNames.Add(col.Name);
}
outputColumnNames.Add(col.Name);
}

// Only count those columns that were in the input of the pipeline
passThroughColumnNames.IntersectWith(inputKeyDataViewTypeColumnsNames);
return passThroughColumnNames;
}

private void Run(IChannel ch)
{
ILegacyDataLoader loader = null;
Expand Down Expand Up @@ -308,6 +355,20 @@ private void Run(IChannel ch)
Host.Assert(scorePipe.Source == end);
end = scorePipe;
transforms.AddLast(scoreOnnx);

if(rawPred.PredictionKind == PredictionKind.BinaryClassification || rawPred.PredictionKind == PredictionKind.MulticlassClassification)
{
// Check if the PredictedLabel Column is a KeyDataViewType and has KeyValue Annotations.
// If it does, add a KeyToValueMappingTransformer, to enable NimbusML to get the values back
// when using an ONNX model, as described in https://github.com/dotnet/machinelearning/pull/4841
var predictedLabelColumn = scorePipe.Schema.GetColumnOrNull(DefaultColumnNames.PredictedLabel);
if (predictedLabelColumn.HasValue && HasKeyValues(predictedLabelColumn.Value))
{
var outputData = new KeyToValueMappingTransformer(Host, DefaultColumnNames.PredictedLabel).Transform(scorePipe);
end = outputData;
transforms.AddLast(outputData as ITransformCanSaveOnnx);
}
}
}
else
{
Expand All @@ -322,6 +383,18 @@ private void Run(IChannel ch)
nameof(Arguments.LoadPredictor), "We were explicitly told to load the predictor but one was not present.");
}

// Convert back to values the KeyDataViewType "pass-through" columns
// (i.e those that remained untouched by the model). This is done to enable NimbusML to get these values
// as described in https://github.com/dotnet/machinelearning/pull/4841

var passThroughColumnNames = GetPassThroughKeyDataViewTypeColumnsNames(source, end);
foreach (var name in passThroughColumnNames)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passThroughColumnNames [](start = 33, length = 22)

Here is another edge case: If the input has a categorical column, say A, and the pipeline has a CopyTransform that copies A into a column named B.
The way it is written now, we will not apply key-to-value on B. Should we?
On one hand, we should, because A and B should be copies of each other, so it does not make sense if the ONNX model has different types and values for them... On the other hand, I don't see an easy rule that would help decide when we do need to apply key-to-value to B (when it is the output of Copy), versus when we don't (for example, when it is the output of ToKey).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue with Concat: say the input has two categorical columns A and B, and the pipeline has a concat transform that concatenates them to produce a column C...


In reply to: 383580868 [](ancestors = 383580868)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I think we should apply KeyToValueTransformer to B. But, as you say, I can't think of an easy way to decide when to do this when it's the output of Copy. So I wouldn't know if we should handle this particular edge case.

Thoughts, @ganik , @harishsk ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, that is a bit strange side-effect. Perhaps in next iterations we can get more details on how AutoML prefers this. Maybe we wont need to add KeyToValue for passthrough columns at all and let them surface as keys. For now we can document this.

{
var outputData = new KeyToValueMappingTransformer(Host, name).Transform(end);
end = outputData;
transforms.AddLast(end as ITransformCanSaveOnnx);
}

var model = ConvertTransformListToOnnxModel(ctx, ch, source, end, transforms, _inputsToDrop, _outputsToDrop);

using (var file = Host.CreateOutputFile(_outputModelPath))
Expand Down