Skip to content

Created sample for 'TokenizeIntoCharactersAsKeys' API. #3123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 28, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
using System;
using System.Collections.Generic;
using System.Text;

namespace Microsoft.ML.Samples.Dynamic
{
public static class TokenizeIntoCharactersAsKeys
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var mlContext = new MLContext();

// Create an empty data sample list. The 'TokenizeIntoCharactersAsKeys' does not require training data as
// the estimator ('TokenizingByCharactersEstimator') created by 'TokenizeIntoCharactersAsKeys' API is not a trainable estimator.
// The empty list is only needed to pass input schema to the pipeline.
var emptySamples = new List<TextData>();

// Convert sample list to an empty IDataView.
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);

// A pipeline for converting text into vector of characters.
// The 'TokenizeIntoCharactersAsKeys' produces result as key type.
// 'MapKeyToValue' is need to map keys back to their original values.
var textPipeline = mlContext.Transforms.Text.TokenizeIntoCharactersAsKeys("CharTokens", "Text", useMarkerCharacters: false)
.Append(mlContext.Transforms.Conversion.MapKeyToValue("CharTokens"));

// Fit to data.
var textTransformer = textPipeline.Fit(emptyDataView);

// Create the prediction engine to get the character vector from the input text/string.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(textTransformer);

// Call the prediction API to convert the text into characters.
var data = new TextData() { Text = "ML.NET's TokenizeIntoCharactersAsKeys API splits text/string into characters." };
var prediction = predictionEngine.Predict(data);

// Print the length of the character vector.
Console.WriteLine($"Number of tokens: {prediction.CharTokens.Length}");

// Print the character vector.
Console.WriteLine($"\nCharacter Tokens: {string.Join(",", prediction.CharTokens)}");

// Expected output:
// Number of tokens: 77
// Character Tokens: M,L,.,N,E,T,',s,<?>,T,o,k,e,n,i,z,e,I,n,t,o,C,h,a,r,a,c,t,e,r,s,A,s,K,e,y,s,<?>,A,P,I,<?>,
// s,p,l,i,t,s,<?>,t,e,x,t,/,s,t,r,i,n,g,<?>,i,n,t,o,<?>,c,h,a,r,a,c,t,e,r,s,.
//
// <?>: is a unicode control character used instead of spaces ('\u2400').
}

public class TextData
{
public string Text { get; set; }
}

public class TransformedTextData : TextData
{
public string[] CharTokens { get; set; }
}
}
}
7 changes: 7 additions & 0 deletions src/Microsoft.ML.Transforms/Text/TextCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,13 @@ public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.Text
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="useMarkerCharacters">Whether to prepend a marker character, <see langword="0x02"/>, to the beginning,
/// and append another marker character, <see langword="0x03"/>, to the end of the output vector of characters.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[TokenizeIntoCharacters](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/TokenizeIntoCharactersAsKeys.cs)]
/// ]]>
/// </format>
/// </example>
public static TokenizingByCharactersEstimator TokenizeIntoCharactersAsKeys(this TransformsCatalog.TextTransforms catalog,
string outputColumnName,
string inputColumnName = null,
Expand Down