Learning with counts (Dracula) transformer #4514

yaeldekel · 2019-12-02T09:53:51Z

eerhardt · 2019-12-02T15:57:25Z

src/Microsoft.ML.Transforms/HashJoiningTransform.cs

@@ -188,9 +187,9 @@ private static VersionInfo GetVersionInfo()
            IDataView input,
            string name,
            string source = null,
-             bool join = Defaults.Join,
+             bool join = Defaults.Combine,


bool join => bool combine. And update the xml docs #Resolved

Actually, I am rethinking this comment. What's weird to me is that this is the "hash joining transform", and we are calling this behavior "combine". That seems inconsistent. We should use the same term to mean the same thing. Either always use "join" or always use "combine". #Resolved

I was actually thinking that the name HashJoiningTransform is not very informative. What do you think about VectorHashingTransform?

In reply to: 356308668 [](ancestors = 356308668)

VectoHashingTransform implies each element is going to be hashed separately (which is the default behavior). With respect to @eerhardt's comment that we should user either Join or Combine consistently, a better name might be HashCombiningTransform.

In reply to: 356607566 [](ancestors = 356607566,356308668)

I have reverted the changes in this file, since it is no longer being used in this PR. This transform is now not being used directly anywhere in the code, and just being kept around for backward compatibility of maml, and for its entry point.

In reply to: 427561266 [](ancestors = 427561266,356607566,356308668)

eerhardt · 2019-12-02T15:59:20Z

src/Microsoft.ML.Transforms/Dracula/DraculaTransform.cs

+        }
+    }
+
+    public static class Dracula


No need for this to be public, it only has an internal method. #Resolved

eerhardt · 2019-12-02T15:59:58Z

src/Microsoft.ML.Transforms/Dracula/CountTableTransformer.cs

+        }
+    }
+
+    public static class CountTable


should be internal #Resolved

eerhardt · 2019-12-02T16:01:13Z

src/Microsoft.ML.Transforms/Dracula/CountTable.cs

+    /// <summary>
+    /// Signature for CountTableBuilder.
+    /// </summary>
+    public delegate void SignatureCountTableBuilder();


Does this need to be public? #Resolved

codemzs · 2019-12-02T17:45:01Z

@yaeldekel Shall we give it a better name as opposed to calling it "Dracula"? Seems a little informal to me. #Resolved

codemzs · 2019-12-02T17:46:30Z

src/Microsoft.ML.Transforms/Dracula/CMCountTable.cs

+        private static VersionInfo GetVersionInfo()
+        {
+            return new VersionInfo(
+                modelSignature: "CM    CT",


"CM CT", [](start = 32, length = 11)

what kind of model signature is this? Why is there a tab in between? #Resolved

justinormont · 2019-12-02T18:12:08Z

src/Microsoft.ML.Transforms/Dracula/DraculaTransform.cs

+
+namespace Microsoft.ML.Transforms
+{
+    public class DraculaEstimator : IEstimator<DraculaTransformer>


Moving @codemzs' comment to a thread:

@yaeldekel Shall we give it a better name as opposed to calling it "Dracula"? Seems a little informal to me.

If Dracula is not used, CountTargetEncoder might be suitable.

Our old blog post lists DRACULA as "Distributed Robust Algorithm for Count-based Learning": https://blogs.technet.microsoft.com/machinelearning/2015/02/17/big-learning-made-easy-with-counts/

As an aside, there's also public slides:
https://www.slideshare.net/SessionsEvents/misha-bilenko-principal-researcher-microsoft

Good suggestion. Changing the name to CountTargetEncoder, unless someone has a different idea?

codecov · 2019-12-03T11:33:48Z

Codecov Report

Merging #4514 into master will decrease coverage by 2.13%.
The diff coverage is 93.99%.

@@            Coverage Diff             @@
##           master    #4514      +/-   ##
==========================================
- Coverage   75.65%   73.52%   -2.14%     
==========================================
  Files         990     1016      +26     
  Lines      179573   190109   +10536     
  Branches    19308    20467    +1159     
==========================================
+ Hits       135853   139774    +3921     
- Misses      38453    44768    +6315     
- Partials     5267     5567     +300

Flag	Coverage Δ
#Debug	`73.52% <93.99%> (-2.14%)`	⬇️
#production	`69.27% <93.02%> (-2.29%)`	⬇️
#test	`87.61% <100.00%> (-1.29%)`	⬇️

Impacted Files	Coverage Δ
src/Microsoft.ML.Core/Data/IEstimator.cs	`84.53% <ø> (ø)`
test/Microsoft.ML.TestFramework/TestCommandBase.cs	`44.56% <ø> (+0.40%)`	⬆️
src/Microsoft.ML.Transforms/Dracula/CountTable.cs	`86.76% <86.76%> (ø)`
...Microsoft.ML.Transforms/Dracula/MultiCountTable.cs	`88.51% <88.51%> (ø)`
...oft.ML.Transforms/Dracula/CountTableTransformer.cs	`90.08% <90.08%> (ø)`
...crosoft.ML.Transforms/Dracula/CountTableBuilder.cs	`93.75% <93.75%> (ø)`
...ansforms/Dracula/CountTargetEncodingTransformer.cs	`93.75% <93.75%> (ø)`
src/Microsoft.ML.Transforms/Dracula/Featurizer.cs	`96.66% <96.66%> (ø)`
...rc/Microsoft.ML.Transforms/Dracula/CMCountTable.cs	`98.82% <98.82%> (ø)`
...c/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs	`89.65% <100.00%> (+0.36%)`	⬆️
... and 235 more

sharwell · 2019-12-10T21:29:19Z

📝 The OutOfMemoryException test failures appear to be true failures #Resolved

eerhardt · 2019-12-10T22:15:07Z

src/Microsoft.ML.Transforms/Dracula/CountTableTransformer.cs

+
+namespace Microsoft.ML.Transforms
+{
+    public sealed class CountTableEstimator : IEstimator<CountTableTransformer>


Does this class need to be public? How can an external user use this class? #Resolved

Also, xml comments on all public surface area.

In reply to: 356307263 [](ancestors = 356307263)

justinormont · 2019-12-10T22:52:56Z

src/Microsoft.ML.Transforms/Dracula/Featurizer.cs

+
+        // Assumes features are filled with the respective counts.
+        // Adds laplacian noise if set and returns the sum of the counts.
+        private float AddLaplacianNoisePerLabel(int iCol, Random rand, Span<float> counts)


Is the noise still added once the ML model is fully created?

@davidbrownellWork was mentioning that there is added noise at the prediction time. This complicates the ONNX conversion. The AutoML team is looking to use Dracula for its target encoding and we need it in ML.NET, NimbusML, and all the way to ONNX. #Resolved

The noise is only used at prediction time. From your experience using this transform, do you think it is needed?

In reply to: 356322013 [](ancestors = 356322013)

I looked at the EstimatorChain class, and it seems to me that it would be possible to add a method similar to AppendCacheCheckPointthat would add noise to a requested column. Here is the code that is currently in the Fit method of EstimatorChain:

IDataView current = input; var xfs = new ITransformer[_estimators.Length]; for (int i = 0; i < _estimators.Length; i++) { var est = _estimators[i]; xfs[i] = est.Fit(current); current = xfs[i].Transform(current); if (_needCacheAfter[i] && i < _estimators.Length - 1) { Contracts.AssertValue(_host); current = new CacheDataView(_host, current, null); } }

If we add, similarly to how we have current = new CacheDataView(...), something like current = new NoiseDataView(...), then we can let the user write something like

var pipeline = mlContext.Transforms.CountTargetEncoding() .AppendNoise("Features").Append(mlContext.Binary.Sdca());

then the SDCA component in the pipeline will see the Features column with noise added to it when training, however, the resulting TransformerChain will not have noise as part of the model.

Together with good samples and documentation emphasizing the recommended way to use the CountTargetEncodingEstimator (which is either to train the transformer in a separate pipeline than the trainer and on different data, or use the AppendNoise API), I think that may be a better solution to having a noise option as part of the estimator constructor.

Please let me know what you think of this solution and whether I should go ahead and implement that.

In reply to: 357085107 [](ancestors = 357085107,356322013)

Hi Yael, what is the name of this transform in TLC? I went to our TLC telemetry to look for internal users of the dracula transform but I did not see any usage for a transformer named "Dracula" #Resolved

@gvashishtha In the TLC GUI it's called "Dracula Transform" (under "Featurizers: Supervised") and in the TLC code it is called DraculaTransform… but I wouldn't know if other names are used in the telemetry. #Resolved

If I recall, the log-odds are returned approximately as:

CountPositve = CountPositive + LaplacianNoise(); // Add noise to pos count CountNegative = CountNegative + LaplacianNoise(); // Add noise to neg count logOdds = ln( (CountPositive + PriorCount) / (CountPositive + CountNegative + 2*PriorCount) ); // Includes 2x noise, plus prior as smoothing

I don't think we'll be able to just add noise outside of the Dracula transform and end up with the same result.

My recommendation is keeping the noise computation as is, and disable noise for the ONNX unit tests, or add a window (e.g. +/- 10%) of acceptable values when checking for (near) equality of the outputs. #Resolved

The two possible solutions as I see it are:

Leaving the computation as it is right now.

Pros: easy solution for fitting a pipeline that has a trainer after the Dracula transform without having to split into two pipelines and training each part on a different dataset.

Cons: if the ONNX export of this transform will not include the noise part (which it likely won't), then this will result in a model that gives different results in the ML.NET format and in the ONNX format.

Making the transformer apply the noise, but not serialize the noise parameter, i.e. after deserialization it won't add noise.

Pros: consistency with the ONNX version of the model. Also, the noise is really not needed at prediction time, it is only needed to train the next component in the pipeline.

Cons: if you call Transform on the pipeline with the same data right after it is trained, and after serializing and deserializing the model, it will give different results.

I don't particularly like either of these solutions, I think they are both problematic, but currently I can't think of a solution that will not have any of the cons listed above... Which solution do you think I should go with?

cc @justinormont, @eerhardt

In reply to: 365470116 [](ancestors = 365470116)

The introduction of noise is a transformation step that is only used when the featurizer is included in a pipeline that is used during training time. Because ONNX exports are only used for inference, it seems OK that the export doesn't include noise, as the usage in ONNX will be the same as when this featurizer is used in an inferencing ML.Net pipeline.

It seems that the first point is the solution, as long as the code is written such that noise is not introduced when the featurizer is part of an inferencing pipeline. #Resolved

codemzs · 2019-12-26T18:20:04Z

src/Microsoft.ML.Transforms/Dracula/DraculaTransform.cs

+        }
+    }
+
+    public sealed class CountTargetEncodingTransformer : ITransformer


CountTargetEncodingTransformer [](start = 24, length = 30)

documentation? #Resolved

codemzs · 2019-12-26T18:20:14Z

src/Microsoft.ML.Transforms/Dracula/DraculaTransform.cs

+
+namespace Microsoft.ML.Transforms
+{
+    public class CountTargetEncodingEstimator : IEstimator<CountTargetEncodingTransformer>


CountTargetEncodingEstimator [](start = 17, length = 28)

Documentation? #Resolved

codemzs · 2019-12-26T18:32:39Z

src/Microsoft.ML.Transforms/Dracula/DraculaTransform.cs

@@ -0,0 +1,553 @@
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.


Please rename the final to more relevant name. If we are calling it CountTragetEncoding then we should not be seeing Dracula in the codebase. #Resolved

harishsk · 2020-02-04T19:47:11Z

This class doesn't follow the typical pattern of using a Mapper. Any particular reason why?
#Resolved

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:35 in 91887da. [](commit_id = 91887da, deletion_comment = False)

harishsk · 2020-02-04T19:48:59Z

How is this different from HashingEstimator? #Resolved

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:35 in 91887da. [](commit_id = 91887da, deletion_comment = False)

harishsk · 2020-02-04T19:52:05Z

};

Internally this too uses the MurmurHash which is used by HashingTransformer. Is it possible to rationalize this transform the HashingTransformer? #Resolved

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:647 in 91887da. [](commit_id = 91887da, deletion_comment = False)

harishsk · 2020-02-04T19:55:28Z

                return Hashing.MurmurHash(seed, sb, 0, sb.Length);

We have had a number of backward compatibility issues to address with MurmurHash when trying to export to onnx. Can we defer making this public until we have implemented the onnx export functionality for this transformer? #Resolved

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:646 in 91887da. [](commit_id = 91887da, deletion_comment = False)

yaeldekel · 2020-02-05T14:12:19Z

This class was never converted to IEstimator/ITransformer because it was never a priority. It is currently only used in the entry points for train-test and cv split. The cv command and the public train-test and cv APIs use all use the HashingEstimator.
There are a couple of differences between HashJoiningTransform and HashingEstimator:

HashJoiningTransform hashes floats/doubles using murmur hash, and the rest of the types by converting the value to string, and then hashing. HashingEstimator has a hash function the simple data view types: floats, doubles, all the integer types, all the unsigned integer types, all the key types, bool and text. I think that it was implemented that way for simplicity, not because of some kind of advantage of hashing strings over other types.
HashJoiningTransform has an option to hash a vector of values into a single hash value, as opposed to the HashingEstimator, that always handles vectors by hashing each slot separately and producing a vector of hashes. This is an advantage that is useful for cv and train-test splits, since these splits apply a RangeFilter on top of the output of HashingTransformer, which would cause an exception if given a vector column.
HashJoiningTransform has an option to specify subsets of the vector input that should be hashed together. For example, the user can specity that they wish to hash slots 1,3,5 into a single hash value, and slots 0,2,4 into a single hash value (this option is where the name of the transform came from :-)). I think that this capability was added specifically for the dracula transform, although I don't know how useful it is (@justinormont, can you say anything about that?).

In reply to: 582084074 [](ancestors = 582084074)

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:35 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False)

yaeldekel · 2020-02-05T14:15:55Z

};

I think that if we add the option to hash a vector into a single hash to HashingTransformer then we can switch the CountTargetEncodingTransformer to use it instead of HashJoiningTransform, but I would like to wait for @justinormont to comment about the importance of the Combine option first.

In reply to: 582085386 [](ancestors = 582085386)

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:647 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False)

yaeldekel · 2020-02-05T14:16:48Z

                return Hashing.MurmurHash(seed, sb, 0, sb.Length);

Is there a plan to switch HashingTransformer to use something other than Murmur hash?

In reply to: 582086748 [](ancestors = 582086748)

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:646 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False)

harishsk · 2020-02-11T07:08:32Z

                return Hashing.MurmurHash(seed, sb, 0, sb.Length);

No, there is no plan to switch to something other MurmurHash. But we have had a number of discussions on how to keep compatibility between the version of MurmurHash in ML.NET and the one in Onnx. We have had to increment the version of the HashingEstimator to make them compatible.

Since the HashJoiningTransform also uses MurmurHash we should discuss its Onnx export mechanism and implement it before making this public so that if any changes are required in the model versions or on the onnx side, we make it only once.

In reply to: 582428008 [](ancestors = 582428008,582086748)

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:646 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False)

harishsk · 2020-02-11T07:09:34Z

};

It would be great if the CountTargetEncodingTransformer also used the HashingTransformer.

In reply to: 582427616 [](ancestors = 582427616,582085386)

Refers to: src/Microsoft.ML.Transforms/HashJoiningTransform.cs:647 in c3df0e0. [](commit_id = c3df0e0, deletion_comment = False)

…y of dictionaries

harishsk · 2020-05-19T20:05:34Z

test/Microsoft.ML.Tests/Transformers/DraculaTests.cs

+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Collections.Generic;


Can you please rename this file to not call it Dracula? #Resolved

harishsk

yaeldekel requested a review from a team as a code owner December 2, 2019 09:53

eerhardt reviewed Dec 2, 2019

View reviewed changes

codemzs reviewed Dec 2, 2019

View reviewed changes

justinormont reviewed Dec 2, 2019

View reviewed changes

yaeldekel requested a review from a team as a code owner December 10, 2019 20:05

yaeldekel force-pushed the dracula branch from 8e74a76 to ae06cf1 Compare December 10, 2019 20:12

eerhardt reviewed Dec 10, 2019

View reviewed changes

justinormont reviewed Dec 10, 2019

View reviewed changes

yaeldekel force-pushed the dracula branch from 6da3ef4 to 26e2519 Compare December 19, 2019 10:01

codemzs reviewed Dec 26, 2019

View reviewed changes

yaeldekel force-pushed the dracula branch from ccb6f56 to 0061ead Compare December 27, 2019 23:26

yaeldekel mentioned this pull request Feb 12, 2020

Add support for combining hashes in vector columns to HashingTransformer #4828

Merged

justinormont mentioned this pull request Apr 6, 2020

Overflow in MultiClassNaiveBayes #3228

Closed

yaeldMS added 22 commits May 18, 2020 12:12

Address code review comments

08a7108

create estimator from trained transformer

2b2428b

switch from three dimensional array of counts to two dimensional arra…

0ef935b

…y of dictionaries

change mechanism for loading a pre-trained count table

097f2f1

Add a sample

73fd1ff

fix entrypoint catalog

741b5ad

documentation

8d00f47

count table transform

c13ff8d

Dracula with unit tests

f756a20

Address code review comments

8880cc5

create estimator from trained transformer

330c6c5

change mechanism for loading a pre-trained count table

5c33181

Add a sample

c8a9df5

documentation

a93ea18

fix unit tests

9c921ce

Delete unused file

d7616a7

make CountTable* classes internal

90803b7

Possible solution for adding noise only when training a pipeline

c9cf4ce

Fix bug

3d23f80

Make all APIs and classes internal.

96e0041

Exclude dracula sample.

3fc56ec

Switch to using HashingTransformer instead of HashJoiningTransform.

66f7865

yaeldekel force-pushed the dracula branch from 1fff984 to 66f7865 Compare May 18, 2020 14:46

Fix EntryPointCatalog test

91887da

harishsk reviewed May 19, 2020

View reviewed changes

Address code review comments.

c36e5b3

harishsk approved these changes Jun 2, 2020

View reviewed changes

yaeldekel merged commit c0eeea9 into dotnet:master Jun 4, 2020

justinormont mentioned this pull request Nov 19, 2020

Are ML .Net models deterministic? #5497

Closed

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

Learning with counts (Dracula) transformer #4514

Learning with counts (Dracula) transformer #4514

Conversation

yaeldekel commented Dec 2, 2019

eerhardt Dec 2, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

eerhardt Dec 10, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eerhardt Dec 2, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

eerhardt Dec 2, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

eerhardt Dec 2, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

codemzs commented Dec 2, 2019 • edited by yaeldekel Loading

codemzs Dec 2, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 3, 2019 • edited Loading

Codecov Report

sharwell commented Dec 10, 2019 • edited by yaeldekel Loading

eerhardt Dec 10, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinormont Dec 10, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaeldekel Jan 8, 2020 • edited Loading

Choose a reason for hiding this comment

gvashishtha Jan 9, 2020 • edited by yaeldekel Loading

Choose a reason for hiding this comment

antoniovs1029 Jan 9, 2020 • edited by yaeldekel Loading

Choose a reason for hiding this comment

justinormont Jan 10, 2020 • edited by yaeldekel Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbrownellWork Jan 27, 2020 • edited by yaeldekel Loading

Choose a reason for hiding this comment

codemzs Dec 26, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

codemzs Dec 26, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

codemzs Dec 26, 2019 • edited by yaeldekel Loading

Choose a reason for hiding this comment

harishsk commented Feb 4, 2020 • edited by yaeldekel Loading

harishsk commented Feb 4, 2020 • edited by yaeldekel Loading

harishsk commented Feb 4, 2020 • edited by yaeldekel Loading

harishsk commented Feb 4, 2020 • edited by yaeldekel Loading

yaeldekel commented Feb 5, 2020

yaeldekel commented Feb 5, 2020

yaeldekel commented Feb 5, 2020

harishsk commented Feb 11, 2020

harishsk commented Feb 11, 2020

harishsk May 19, 2020 • edited by yaeldekel Loading

Choose a reason for hiding this comment

harishsk left a comment

Choose a reason for hiding this comment

eerhardt Dec 2, 2019 •

edited by yaeldekel

Loading

eerhardt Dec 10, 2019 •

edited by yaeldekel

Loading

eerhardt Dec 2, 2019 •

edited by yaeldekel

Loading

eerhardt Dec 2, 2019 •

edited by yaeldekel

Loading

eerhardt Dec 2, 2019 •

edited by yaeldekel

Loading

codemzs commented Dec 2, 2019 •

edited by yaeldekel

Loading

codemzs Dec 2, 2019 •

edited by yaeldekel

Loading

codecov bot commented Dec 3, 2019 •

edited

Loading

sharwell commented Dec 10, 2019 •

edited by yaeldekel

Loading

eerhardt Dec 10, 2019 •

edited by yaeldekel

Loading

justinormont Dec 10, 2019 •

edited by yaeldekel

Loading

yaeldekel Jan 8, 2020 •

edited

Loading

gvashishtha Jan 9, 2020 •

edited by yaeldekel

Loading

antoniovs1029 Jan 9, 2020 •

edited by yaeldekel

Loading

justinormont Jan 10, 2020 •

edited by yaeldekel

Loading

davidbrownellWork Jan 27, 2020 •

edited by yaeldekel

Loading

codemzs Dec 26, 2019 •

edited by yaeldekel

Loading

codemzs Dec 26, 2019 •

edited by yaeldekel

Loading

codemzs Dec 26, 2019 •

edited by yaeldekel

Loading

harishsk commented Feb 4, 2020 •

edited by yaeldekel

Loading

harishsk commented Feb 4, 2020 •

edited by yaeldekel

Loading

harishsk commented Feb 4, 2020 •

edited by yaeldekel

Loading

harishsk commented Feb 4, 2020 •

edited by yaeldekel

Loading

harishsk May 19, 2020 •

edited by yaeldekel

Loading