-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Created sample for 'TokenizeIntoCharactersAsKeys' API. #3123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// Expected output: | ||
// Number of tokens: 112 | ||
// Character Tokens: M,L,.,N,E,T,',s,<?>,T,o,k,e,n,i,z,e,I,n,t,o,C,h,a,r,a,c,t,e,r,s,A,s,K,e,y,s,<?>,A,P,I,<?>, | ||
// s,p,l,i,t,s,<?>,t,e,x,t,/,s,t,r,i,n,g,<?>,i,n,t,o,<?>,c,h,a,r,a,c,t,e,r,s,. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 73, length = 3)
do we really present space
as <?>
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a unit separator special character.
private const ushort UnitSeparator = 0x1f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry! this is the control character used instead of spaces. Please disregard my previous comments
bldr.Append((char)(c + '\u2400')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
namespace Microsoft.ML.Samples.Dynamic | ||
{ | ||
public static class TokenizeIntoCharacters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TokenizeIntoCharacters [](start = 24, length = 22)
let's name the file and class the same as the api. please make sure to update the name in the xml reference when you rename #Resolved
// Create an empty data sample list. The 'TokenizeIntoCharactersAsKeys' does not require training data as | ||
// the estimator ('TokenizingByCharactersEstimator') created by 'TokenizeIntoCharactersAsKeys' API is not a trainable estimator. | ||
// The empty list is only needed to pass input schema to the pipeline. | ||
var samples = new List<TextData>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
samples [](start = 16, length = 7)
let's call this emptySamples to complement the comments above #Resolved
var samples = new List<TextData>(); | ||
|
||
// Convert sample list to an empty IDataView. | ||
var dataview = mlContext.Data.LoadFromEnumerable(samples); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataview [](start = 16, length = 8)
also emptyDataview #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #3123 +/- ##
==========================================
- Coverage 72.52% 72.51% -0.01%
==========================================
Files 808 808
Lines 144665 144665
Branches 16198 16198
==========================================
- Hits 104913 104903 -10
- Misses 35342 35349 +7
- Partials 4410 4413 +3
|
Thanks! |
Related to #1209.