Skip to content

Samples for categorical transform estimators #3179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 4, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using static Microsoft.ML.Transforms.OneHotEncodingEstimator;
Copy link

@shmoradims shmoradims Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for? can we remove it?
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for the OutputKind.Key parameter that we use in the example below


In reply to: 272292647 [](ancestors = 272292647)


namespace Microsoft.ML.Samples.Dynamic
namespace Samples.Dynamic
{
public static class OneHotEncoding
{
Expand All @@ -17,53 +17,39 @@ public static void Example()
// Get a small dataset as an IEnumerable.
var samples = new List<DataPoint>()
{
new DataPoint(){ Label = 0, Education = "0-5yrs" },
new DataPoint(){ Label = 1, Education = "0-5yrs" },
new DataPoint(){ Label = 45, Education = "6-11yrs" },
new DataPoint(){ Label = 50, Education = "6-11yrs" },
new DataPoint(){ Label = 50, Education = "11-15yrs" },
new DataPoint(){ Education = "0-5yrs" },
new DataPoint(){ Education = "0-5yrs" },
new DataPoint(){ Education = "6-11yrs" },
new DataPoint(){ Education = "6-11yrs" },
new DataPoint(){ Education = "11-15yrs" },
};

// Convert training data to IDataView.
var trainData = mlContext.Data.LoadFromEnumerable(samples);
var data = mlContext.Data.LoadFromEnumerable(samples);

// A pipeline for one hot encoding the Education column.
var bagPipeline = mlContext.Transforms.Categorical.OneHotEncoding("EducationOneHotEncoded", "Education", OutputKind.Bag);
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

", OutputKind.Bag); [](start = 114, length = 19)

I would leave it, so that it makes sense why we call it bagPipeline. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am using the default (which is uses Indicator , not bagging) .. also renamed it to pipeline


In reply to: 271811206 [](ancestors = 271811206)

// Fit to data.
var bagTransformer = bagPipeline.Fit(trainData);
var pipeline = mlContext.Transforms.Categorical.OneHotEncoding("EducationOneHotEncoded", "Education");

// Get transformed data
var bagTransformedData = bagTransformer.Transform(trainData);
// Getting the data of the newly created column, so we can preview it.
var bagEncodedColumn = bagTransformedData.GetColumn<float[]>("EducationOneHotEncoded");
// Fit and transform the data.
var oneHotEncodedData = pipeline.Fit(data).Transform(data);

PrintDataColumn(oneHotEncodedData, "EducationOneHotEncoded");
// We have 3 slots, because there are three categories in the 'Education' column.
// 1 0 0
// 1 0 0
// 0 1 0
// 0 1 0
// 0 0 1

// A pipeline for one hot encoding the Education column (using keying).
var keyPipeline = mlContext.Transforms.Categorical.OneHotEncoding("EducationOneHotEncoded", "Education", OutputKind.Key);
// Fit to data.
var keyTransformer = keyPipeline.Fit(trainData);

// Get transformed data
var keyTransformedData = keyTransformer.Transform(trainData);
// Getting the data of the newly created column, so we can preview it.
var keyEncodedColumn = keyTransformedData.GetColumn<uint>("EducationOneHotEncoded");
// Fit and Transform data.
oneHotEncodedData = keyPipeline.Fit(data).Transform(data);

Console.WriteLine("One Hot Encoding based on the bagging strategy.");
foreach (var row in bagEncodedColumn)
{
for (var i = 0; i < row.Length; i++)
Console.Write($"{row[i]} ");
}

// data column obtained post-transformation.
// Since there are only two categories in the Education column of the trainData, the output vector
// for one hot will have two slots.
//
// 0 0 0
// 0 0 0
// 0 0 1
// 0 0 1
// 0 1 0
var keyEncodedColumn = oneHotEncodedData.GetColumn<uint>("EducationOneHotEncoded");

Console.WriteLine("One Hot Encoding with key type output.");
Console.WriteLine("One Hot Encoding of single column 'Education', with key type output.");
foreach (var element in keyEncodedColumn)
Console.WriteLine(element);

Expand All @@ -72,13 +58,20 @@ public static void Example()
// 2
// 2
// 3

}
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a separate example, because the multi-output is a different API. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made separate example for multi input


In reply to: 271811670 [](ancestors = 271811670)

private static void PrintDataColumn(IDataView transformedData, string columnName)
{
var countSelectColumn = transformedData.GetColumn<float[]>(transformedData.Schema[columnName]);

foreach (var row in countSelectColumn)
{
for (var i = 0; i < row.Length; i++)
Console.Write($"{row[i]}\t");
Console.WriteLine();
}
}
private class DataPoint
{
public float Label { get; set; }

public string Education { get; set; }
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
using System;
using System.Collections.Generic;
using Microsoft.ML;

namespace Samples.Dynamic
{
public static class OneHotEncodingMultiColumn
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var mlContext = new MLContext();

// Get a small dataset as an IEnumerable.
var samples = new List<DataPoint>()
{
new DataPoint(){ Education = "0-5yrs", ZipCode = "98005" },
new DataPoint(){ Education = "0-5yrs", ZipCode = "98052" },
new DataPoint(){ Education = "6-11yrs", ZipCode = "98005" },
new DataPoint(){ Education = "6-11yrs", ZipCode = "98052" },
new DataPoint(){ Education = "11-15yrs", ZipCode = "98005" },
};

// Convert training data to IDataView.
var data = mlContext.Data.LoadFromEnumerable(samples);

// Multi column example : A pipeline for one hot encoding two columns 'Education' and 'ZipCode'
var multiColumnKeyPipeline = mlContext.Transforms.Categorical.OneHotEncoding(
new InputOutputColumnPair[] {
new InputOutputColumnPair("Education"),
new InputOutputColumnPair("ZipCode"),
});

// Fit and Transform data.
var transformedData = multiColumnKeyPipeline.Fit(data).Transform(data);

var convertedData = mlContext.Data.CreateEnumerable<TransformedData>(transformedData, true);

Console.WriteLine("One Hot Encoding of two columns 'Education' and 'ZipCode'.");
foreach (var item in convertedData)
Console.WriteLine("{0}\t\t\t{1}", string.Join(" ", item.Education), string.Join(" ", item.ZipCode));

// 1 0 0 1 0
// 1 0 0 0 1
// 0 1 0 1 0
// 0 1 0 0 1
// 0 0 1 1 0
}

private class DataPoint
{
public string Education { get; set; }

public string ZipCode { get; set; }
}

private class TransformedData
{
public float[] Education { get; set; }

public float[] ZipCode { get; set; }
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;

namespace Samples.Dynamic
{
public static class OneHotHashEncoding
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var mlContext = new MLContext();

// Get a small dataset as an IEnumerable.
var samples = new List<DataPoint>()
{
new DataPoint(){ Education = "0-5yrs" },
new DataPoint(){ Education = "0-5yrs" },
new DataPoint(){ Education = "6-11yrs" },
new DataPoint(){ Education = "6-11yrs" },
new DataPoint(){ Education = "11-15yrs" },
};

// Convert training data to IDataView.
var data = mlContext.Data.LoadFromEnumerable(samples);

// A pipeline for one hot hash encoding the 'Education' column.
var pipeline = mlContext.Transforms.Categorical.OneHotHashEncoding("EducationOneHotHashEncoded", "Education", numberOfBits: 3);

// Fit and transform the data.
var hashEncodedData = pipeline.Fit(data).Transform(data);

PrintDataColumn(hashEncodedData, "EducationOneHotHashEncoded");
// We have 8 slots, because we used numberOfBits = 3.

// 0 0 0 1 0 0 0 0
// 0 0 0 1 0 0 0 0
// 0 0 0 0 1 0 0 0
// 0 0 0 0 1 0 0 0
// 0 0 0 0 0 0 0 1

// A pipeline for one hot hash encoding the 'Education' column (using keying strategy).
var keyPipeline = mlContext.Transforms.Categorical.OneHotHashEncoding("EducationOneHotHashEncoded", "Education",
outputKind: OneHotEncodingEstimator.OutputKind.Key,
numberOfBits: 3);

// Fit and transform the data.
var hashKeyEncodedData = keyPipeline.Fit(data).Transform(data);

// Getting the data of the newly created column, so we can preview it.
var keyEncodedColumn = hashKeyEncodedData.GetColumn<uint>("EducationOneHotHashEncoded");

Console.WriteLine("One Hot Hash Encoding of single column 'Education', with key type output.");
foreach (var element in keyEncodedColumn)
Console.WriteLine(element);

// 4
// 4
// 5
// 5
// 8
}
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate example. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created separate example for multi column


In reply to: 271811872 [](ancestors = 271811872)


private static void PrintDataColumn(IDataView transformedData, string columnName)
{
var countSelectColumn = transformedData.GetColumn<float[]>(transformedData.Schema[columnName]);

foreach (var row in countSelectColumn)
{
for (var i = 0; i < row.Length; i++)
Console.Write($"{row[i]}\t");
Console.WriteLine();
}
}

private class DataPoint
{
public string Education { get; set; }
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
using System;
using System.Collections.Generic;
using Microsoft.ML;

namespace Samples.Dynamic
{
public static class OneHotHashEncodingMultiColumn
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var mlContext = new MLContext();

// Get a small dataset as an IEnumerable.
var samples = new List<DataPoint>()
{
new DataPoint(){ Education = "0-5yrs", ZipCode = "98005" },
new DataPoint(){ Education = "0-5yrs", ZipCode = "98052" },
new DataPoint(){ Education = "6-11yrs", ZipCode = "98005" },
new DataPoint(){ Education = "6-11yrs", ZipCode = "98052" },
new DataPoint(){ Education = "11-15yrs", ZipCode = "98005" },
};

// Convert training data to IDataView.
var data = mlContext.Data.LoadFromEnumerable(samples);

// Multi column example : A pipeline for one hot has encoding two columns 'Education' and 'ZipCode'
var multiColumnKeyPipeline = mlContext.Transforms.Categorical.OneHotHashEncoding(
new InputOutputColumnPair[] { new InputOutputColumnPair("Education"), new InputOutputColumnPair("ZipCode") },
numberOfBits: 3);

// Fit and Transform the data.
var transformedData = multiColumnKeyPipeline.Fit(data).Transform(data);

var convertedData = mlContext.Data.CreateEnumerable<TransformedData>(transformedData, true);

Console.WriteLine("One Hot Hash Encoding of two columns 'Education' and 'ZipCode'.");
foreach (var item in convertedData)
Console.WriteLine("{0}\t\t\t{1}", string.Join(" ", item.Education), string.Join(" ", item.ZipCode));

// We have 8 slots, because we used numberOfBits = 3.

// 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
// 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
// 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
// 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
// 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
}

private class DataPoint
{
public string Education { get; set; }

public string ZipCode { get; set; }
}

private class TransformedData
{
public float[] Education { get; set; }

public float[] ZipCode { get; set; }
}
}
}
20 changes: 19 additions & 1 deletion src/Microsoft.ML.Transforms/CategoricalCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ public static class CategoricalCatalog
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[RPCA](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Categorical/OneHotEncoding.cs)]
/// [!code-csharp[OneHotEncoding](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Categorical/OneHotEncoding.cs)]
/// ]]></format>
/// </example>
public static OneHotEncodingEstimator OneHotEncoding(this TransformsCatalog.CategoricalTransforms catalog,
Expand All @@ -53,6 +53,12 @@ public static OneHotEncodingEstimator OneHotEncoding(this TransformsCatalog.Cate
/// If <see cref="ValueToKeyMappingEstimator.KeyOrdinality.ByValue"/>, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>
/// <param name="keyData">Specifies an ordering for the encoding. If specified, this should be a single column data view,
/// and the key-values will be taken from that column. If unspecified, the ordering will be determined from the input data upon fitting.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[OneHotEncoding](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Categorical/OneHotEncodingMultiColumn.cs)]
/// ]]></format>
/// </example>
public static OneHotEncodingEstimator OneHotEncoding(this TransformsCatalog.CategoricalTransforms catalog,
InputOutputColumnPair[] columns,
OneHotEncodingEstimator.OutputKind outputKind = OneHotEncodingEstimator.Defaults.OutKind,
Expand Down Expand Up @@ -103,6 +109,12 @@ internal static OneHotEncodingEstimator OneHotEncoding(this TransformsCatalog.Ca
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="maximumNumberOfInverts"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[OneHotHashEncoding](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Categorical/OneHotHashEncoding.cs)]
/// ]]></format>
/// </example>
public static OneHotHashEncodingEstimator OneHotHashEncoding(this TransformsCatalog.CategoricalTransforms catalog,
string outputColumnName,
string inputColumnName = null,
Expand All @@ -127,6 +139,12 @@ public static OneHotHashEncodingEstimator OneHotHashEncoding(this TransformsCata
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="maximumNumberOfInverts"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[OneHotHashEncoding](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Categorical/OneHotHashEncodingMultiColumn.cs)]
/// ]]></format>
/// </example>
public static OneHotHashEncodingEstimator OneHotHashEncoding(this TransformsCatalog.CategoricalTransforms catalog,
InputOutputColumnPair[] columns,
OneHotEncodingEstimator.OutputKind outputKind = OneHotEncodingEstimator.OutputKind.Indicator,
Expand Down