-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning with counts (Dracula) transformer #4514
Conversation
@@ -188,9 +187,9 @@ private static VersionInfo GetVersionInfo() | |||
IDataView input, | |||
string name, | |||
string source = null, | |||
bool join = Defaults.Join, | |||
bool join = Defaults.Combine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool join
=> bool combine
. And update the xml docs #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am rethinking this comment. What's weird to me is that this is the "hash joining transform", and we are calling this behavior "combine". That seems inconsistent. We should use the same term to mean the same thing. Either always use "join" or always use "combine". #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking that the name HashJoiningTransform
is not very informative. What do you think about VectorHashingTransform
?
In reply to: 356308668 [](ancestors = 356308668)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectoHashingTransform implies each element is going to be hashed separately (which is the default behavior). With respect to @eerhardt's comment that we should user either Join or Combine consistently, a better name might be HashCombiningTransform.
In reply to: 356607566 [](ancestors = 356607566,356308668)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the changes in this file, since it is no longer being used in this PR. This transform is now not being used directly anywhere in the code, and just being kept around for backward compatibility of maml, and for its entry point.
In reply to: 427561266 [](ancestors = 427561266,356607566,356308668)
} | ||
} | ||
|
||
public static class Dracula |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this to be public, it only has an internal method. #Resolved
} | ||
} | ||
|
||
public static class CountTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be internal #Resolved
/// <summary> | ||
/// Signature for CountTableBuilder. | ||
/// </summary> | ||
public delegate void SignatureCountTableBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public? #Resolved
@yaeldekel Shall we give it a better name as opposed to calling it "Dracula"? Seems a little informal to me. #Resolved |
private static VersionInfo GetVersionInfo() | ||
{ | ||
return new VersionInfo( | ||
modelSignature: "CM CT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"CM CT", [](start = 32, length = 11)
what kind of model signature is this? Why is there a tab in between? #Resolved
|
||
namespace Microsoft.ML.Transforms | ||
{ | ||
public class DraculaEstimator : IEstimator<DraculaTransformer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving @codemzs' comment to a thread:
@yaeldekel Shall we give it a better name as opposed to calling it "Dracula"? Seems a little informal to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Dracula
is not used, CountTargetEncoder
might be suitable.
Our old blog post lists DRACULA as "Distributed Robust Algorithm for Count-based Learning": https://blogs.technet.microsoft.com/machinelearning/2015/02/17/big-learning-made-easy-with-counts/
As an aside, there's also public slides:
https://www.slideshare.net/SessionsEvents/misha-bilenko-principal-researcher-microsoft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Changing the name to CountTargetEncoder
, unless someone has a different idea?
Codecov Report
@@ Coverage Diff @@
## master #4514 +/- ##
==========================================
- Coverage 75.65% 73.52% -2.14%
==========================================
Files 990 1016 +26
Lines 179573 190109 +10536
Branches 19308 20467 +1159
==========================================
+ Hits 135853 139774 +3921
- Misses 38453 44768 +6315
- Partials 5267 5567 +300
|
📝 The OutOfMemoryException test failures appear to be true failures #Resolved |
|
||
namespace Microsoft.ML.Transforms | ||
{ | ||
public sealed class CountTableEstimator : IEstimator<CountTableTransformer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this class need to be public
? How can an external user use this class? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// Assumes features are filled with the respective counts. | ||
// Adds laplacian noise if set and returns the sum of the counts. | ||
private float AddLaplacianNoisePerLabel(int iCol, Random rand, Span<float> counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the noise still added once the ML model is fully created?
@davidbrownellWork was mentioning that there is added noise at the prediction time. This complicates the ONNX conversion. The AutoML team is looking to use Dracula for its target encoding and we need it in ML.NET, NimbusML, and all the way to ONNX. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The noise is only used at prediction time. From your experience using this transform, do you think it is needed?
In reply to: 356322013 [](ancestors = 356322013)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the EstimatorChain class, and it seems to me that it would be possible to add a method similar to AppendCacheCheckPointthat would add noise to a requested column. Here is the code that is currently in the Fit method of EstimatorChain:
IDataView current = input;
var xfs = new ITransformer[_estimators.Length];
for (int i = 0; i < _estimators.Length; i++)
{
var est = _estimators[i];
xfs[i] = est.Fit(current);
current = xfs[i].Transform(current);
if (_needCacheAfter[i] && i < _estimators.Length - 1)
{
Contracts.AssertValue(_host);
current = new CacheDataView(_host, current, null);
}
}
If we add, similarly to how we have current = new CacheDataView(...)
, something like current = new NoiseDataView(...)
, then we can let the user write something like
var pipeline = mlContext.Transforms.CountTargetEncoding()
.AppendNoise("Features").Append(mlContext.Binary.Sdca());
then the SDCA component in the pipeline will see the Features
column with noise added to it when training, however, the resulting TransformerChain
will not have noise as part of the model.
Together with good samples and documentation emphasizing the recommended way to use the CountTargetEncodingEstimator
(which is either to train the transformer in a separate pipeline than the trainer and on different data, or use the AppendNoise API), I think that may be a better solution to having a noise option as part of the estimator constructor.
Please let me know what you think of this solution and whether I should go ahead and implement that.
In reply to: 357085107 [](ancestors = 357085107,356322013)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Yael, what is the name of this transform in TLC? I went to our TLC telemetry to look for internal users of the dracula transform but I did not see any usage for a transformer named "Dracula" #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gvashishtha In the TLC GUI it's called "Dracula Transform" (under "Featurizers: Supervised") and in the TLC code it is called DraculaTransform
… but I wouldn't know if other names are used in the telemetry. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall, the log-odds are returned approximately as:
CountPositve = CountPositive + LaplacianNoise(); // Add noise to pos count
CountNegative = CountNegative + LaplacianNoise(); // Add noise to neg count
logOdds = ln( (CountPositive + PriorCount) / (CountPositive + CountNegative + 2*PriorCount) ); // Includes 2x noise, plus prior as smoothing
I don't think we'll be able to just add noise outside of the Dracula transform and end up with the same result.
My recommendation is keeping the noise computation as is, and disable noise for the ONNX unit tests, or add a window (e.g. +/- 10%) of acceptable values when checking for (near) equality of the outputs. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two possible solutions as I see it are:
- Leaving the computation as it is right now.
- Pros: easy solution for fitting a pipeline that has a trainer after the Dracula transform without having to split into two pipelines and training each part on a different dataset.
- Cons: if the ONNX export of this transform will not include the noise part (which it likely won't), then this will result in a model that gives different results in the ML.NET format and in the ONNX format.
- Making the transformer apply the noise, but not serialize the noise parameter, i.e. after deserialization it won't add noise.
- Pros: consistency with the ONNX version of the model. Also, the noise is really not needed at prediction time, it is only needed to train the next component in the pipeline.
- Cons: if you call
Transform
on the pipeline with the same data right after it is trained, and after serializing and deserializing the model, it will give different results.
I don't particularly like either of these solutions, I think they are both problematic, but currently I can't think of a solution that will not have any of the cons listed above... Which solution do you think I should go with?
In reply to: 365470116 [](ancestors = 365470116)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction of noise is a transformation step that is only used when the featurizer is included in a pipeline that is used during training time. Because ONNX exports are only used for inference, it seems OK that the export doesn't include noise, as the usage in ONNX will be the same as when this featurizer is used in an inferencing ML.Net pipeline.
It seems that the first point is the solution, as long as the code is written such that noise is not introduced when the featurizer is part of an inferencing pipeline. #Resolved
} | ||
} | ||
|
||
public sealed class CountTargetEncodingTransformer : ITransformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CountTargetEncodingTransformer [](start = 24, length = 30)
documentation? #Resolved
|
||
namespace Microsoft.ML.Transforms | ||
{ | ||
public class CountTargetEncodingEstimator : IEstimator<CountTargetEncodingTransformer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CountTargetEncodingEstimator [](start = 17, length = 28)
Documentation? #Resolved
@@ -0,0 +1,553 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. | |||
// The .NET Foundation licenses this file to you under the MIT license. | |||
// See the LICENSE file in the project root for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the final to more relevant name. If we are calling it CountTragetEncoding then we should not be seeing Dracula in the codebase. #Resolved
We have had a number of backward compatibility issues to address with MurmurHash when trying to export to onnx. Can we defer making this public until we have implemented the onnx export functionality for this transformer? #Resolved Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:646 in 91887da. [](commit_id = 91887da, deletion_comment = False) |
This class was never converted to
In reply to: 582084074 [](ancestors = 582084074) Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:35 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False) |
I think that if we add the option to hash a vector into a single hash to In reply to: 582085386 [](ancestors = 582085386) Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:647 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False) |
Is there a plan to switch In reply to: 582086748 [](ancestors = 582086748) Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:646 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False) |
No, there is no plan to switch to something other MurmurHash. But we have had a number of discussions on how to keep compatibility between the version of MurmurHash in ML.NET and the one in Onnx. We have had to increment the version of the HashingEstimator to make them compatible. Since the HashJoiningTransform also uses MurmurHash we should discuss its Onnx export mechanism and implement it before making this public so that if any changes are required in the model versions or on the onnx side, we make it only once. In reply to: 582428008 [](ancestors = 582428008,582086748) Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:646 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False) |
…y of dictionaries
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
|
||
using System.Collections.Generic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rename this file to not call it Dracula? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #4016.