Skip to content

[Part 3] Added convenience constructors for set of transforms. #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 16, 2018

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Jul 10, 2018

This PR fixes #518. The convenience constructors were added for following transforms.

  • GroupTransform.cs
  • HashJoinTransform.cs
  • KeyToBinaryVectorTransform.cs
  • LoadTransform.cs
  • MissingValueIndicatorTransform.cs
  • MutualInformationFeatureSelectionTransform.cs
  • NADropTransform.cs
  • NAHandleTransform.cs
  • NAIndicatorTransform.cs
  • NAReplaceTransform.cs
  • OptionalColumnTransform.cs
  • RffTransform.cs
  • UngroupTransform.cs
  • WhiteningTransform.cs

@@ -79,6 +79,18 @@ public sealed class Arguments : TransformInputBase

private readonly SchemaImpl _schemaImpl;

/// <summary>
/// Convenience constructor for public facing API.
Copy link
Contributor

@TomFinley TomFinley Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convenience constructor for public facing API. [](start = 12, length = 46)

We might want to improve this description, since from the point of view of the user it doesn't really help them much to know that they're using the public facing API really. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I see this was mostly done via copy-paste, I'm afraid that an actual description will be necessary. Look on the bright side, I think that an understanding of how it could be used can only help you imagine the best way in which the convenience should be structured.


In reply to: 201910848 [](ancestors = 201910848)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably obvious, but just in case not clear this is intended as a general comment.


In reply to: 201912669 [](ancestors = 201912669,201910848)

Copy link
Contributor Author

@zeahmed zeahmed Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about adding more description to the constructors and wanted to open a separate issue to include clear description to each of the convenience constructors. What do you say?


In reply to: 201912748 [](ancestors = 201912748,201912669,201910848)

Copy link
Contributor Author

@zeahmed zeahmed Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened issue #524 for addressing this comment. Because it needs to be done for all the transforms we have worked so far.


In reply to: 202117587 [](ancestors = 202117587,201912748,201912669,201910848)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, sounds good @zeahmed


In reply to: 202121318 [](ancestors = 202121318,202117587,201912748,201912669,201910848)

/// Convenience constructor for public facing API.
/// </summary>
/// <param name="env">Host Environment.</param>
/// <param name="input">Input <see cref="IDataView"/>. This is the output from previous transform or loader.</param>
Copy link
Contributor

@TomFinley TomFinley Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input . This is the output from previous transform or loader [](start = 32, length = 83)

This on the other hand might be over-description. Certainly this is not the best place for people to learn how to instantiate IDataViews if they don't already know how to do so. in It's also not quite correct, since there are certainly other data views other than loaders and transforms. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will update it as a part of this issue #524.


In reply to: 201911091 [](ancestors = 201911091)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 12, 2018

    }

Descriptions of these might be good.

I see that the summary comment on the class itself already has some documentation. The relevant section where it explains the enum should be moved, and formatted appropriately, etc. #Closed


Refers to: src/Microsoft.ML.Transforms/UngroupTransform.cs:69 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

/// <param name="input">Input <see cref="IDataView"/>. This is the output from previous transform or loader.</param>
/// <param name="name">Name of the output column.</param>
/// <param name="source">Name of the column to be transformed. If this is null '<paramref name="name"/>' will be used.</param>
/// <param name="newDim">The number of random Fourier features to create.</param>
Copy link
Contributor

@TomFinley TomFinley Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// The number of random Fourier features to create. [](start = 8, length = 81)

This ought to be required. #Closed

/// <param name="name">Name of the output column.</param>
/// <param name="source">Name of the column to be transformed. If this is null '<paramref name="name"/>' will be used.</param>
/// <param name="newDim">The number of random Fourier features to create.</param>
/// <param name="useSin">create two features for every random Fourier frequency? (one for cos and one for sin).</param>
Copy link
Contributor

@TomFinley TomFinley Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create two features for every random Fourier frequency? [](start = 33, length = 55)

So, if newDim was 10, and I set this to true, does the output become 20 or something?

This guy is a little complex. I wonder if we can get away with not including it in the convenience? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it creates 2 * newDim features when useSin=true. Yes, we can remove useSin parameter as it's default value is also false.


In reply to: 201911445 [](ancestors = 201911445)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 12, 2018

"", "ProduceIdTransform", "ProduceId")]

My thinking is that transforms that are hidden probably don't need convenience constructors. This thing in particular is mostly something I have for debugging. #Closed


Refers to: src/Microsoft.ML.Transforms/ProduceIdTransform.cs:12 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 12, 2018

public class OptionalColumnTransform : RowToRowMapperTransformBase

I'm not sure what this class is, ad why it isn't sealed. Do you understand it? #Closed


Refers to: src/Microsoft.ML.Transforms/OptionalColumnTransform.cs:29 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 12, 2018

        Maximum,

Now that these are being used as programmatic constructs maybe migrate the documentation above. #Closed


Refers to: src/Microsoft.ML.Transforms/NAHandleTransform.cs:43 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

}

internal static string RegistrationName = "MutualInformationFeatureSelectionTransform";

/// <summary>
/// A helper method to create <see cref="MutualInformationFeatureSelectionTransform"/> for public facing API.
Copy link
Contributor

@TomFinley TomFinley Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MutualInformationFeatureSelectionTransform [](start = 49, length = 42)

You can't be creating it, it's a static class. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments as before, when documenting we ought to describe what it is actually useful for, not just describing its return values (which are part of the method signature, and we aren't even doing that quite right).


In reply to: 201912038 [](ancestors = 201912038)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add detailed comments as part of #524.


In reply to: 201912123 [](ancestors = 201912123,201912038)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 12, 2018

"", "MissingValueIndicatorTransform", "MissingValueTransform", "MissingTransform", "Missing")]

Same, this is hidden, and mostly deprecated in favor of the NAIndicatorTransform. We only really have it around for backwards compatibility reasons... #Closed


Refers to: src/Microsoft.ML.Transforms/MissingValueIndicatorTransform.cs:16 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 12, 2018

/// Tom  [Table, Kitten]

Oh, memories. :D #Closed


Refers to: src/Microsoft.ML.Transforms/GroupTransform.cs:50 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

/// <param name="input">Input <see cref="IDataView"/>. This is the output from previous transform or loader.</param>
/// <param name="groupKey">Columns to group by</param>
/// <param name="columns">Columns to group together</param>
public GroupTransform(IHostEnvironment env, IDataView input, string[] groupKey, params string[] columns)
Copy link
Contributor

@TomFinley TomFinley Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string[] groupKey [](start = 69, length = 17)

Considering the cases I'm aware of where it is used, I have to think that a single string would be preferable for the convenience constructor. #Closed

@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 12, 2018

public class OptionalColumnTransform : RowToRowMapperTransformBase

After a long discussion yesterday on a separate thread I got to know the purpose of this transform.

This is the transform used to mark some of the columns (e.g. Label) optional during training. So that the columns is not required during scoring. At scoring time, TLC checks to see if the data schema for scoring matches the data used for training except for the optional columns.

I am not sure why its not sealed. Looking at all other transforms, I assume it should be sealed as well. I am not changing it right now unless you see it feasible to do so.


In reply to: 404392058 [](ancestors = 404392058)


Refers to: src/Microsoft.ML.Transforms/OptionalColumnTransform.cs:29 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 12, 2018

/// Tom  [Table, Kitten]

hahaha...easy to catch culprits this way...:D


In reply to: 404392839 [](ancestors = 404392839)


Refers to: src/Microsoft.ML.Transforms/GroupTransform.cs:50 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

private static class Defaults
{
public const string Column = "Id";
}
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably don't need this Defaults class any longer. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we might as well revert the whole file probably.


In reply to: 202426903 [](ancestors = 202426903)

@@ -34,12 +34,30 @@ namespace Microsoft.ML.Runtime.Data
/// </summary>
public static class NAHandleTransform
{
/// <summary>
///
/// </summary>
public enum ReplacementKind
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty? #Closed

@TomFinley
Copy link
Contributor

public class OptionalColumnTransform : RowToRowMapperTransformBase

May as well. And since you've gone to the effort of digging up what it's actually useful for maybe just paste that into an XML comment.


In reply to: 404611599 [](ancestors = 404611599,404392058)


Refers to: src/Microsoft.ML.Transforms/OptionalColumnTransform.cs:29 in d9c4a29. [](commit_id = d9c4a29, deletion_comment = False)

public enum UngroupMode
{
/// <summary>
/// A number of output rows are equal to the minimum length of pivot columns
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these "A number" would read better as "The number". #Closed

@TomFinley
Copy link
Contributor

TomFinley commented Jul 13, 2018

Getting really close thanks @zeahmed! #Closed

/// <summary>
/// This transform is used to mark some of the columns (e.g. Label) optional during training So that the columns is not required during scoring..
/// At scoring time, it is checked if the data schema for scoring matches the data used for training except for the optional columns.
/// </summary>
public class OptionalColumnTransform : RowToRowMapperTransformBase
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class [](start = 4, length = 12)

Sealed? :) #Closed

Outer,

/// <summary>
/// The number of output rows are equal to the length of the first pivot column.
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are [](start = 42, length = 3)

is equal #Closed

Inner,

/// <summary>
/// The number of output rows are equal to the maximum length of pivot columns
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are [](start = 42, length = 3)

is equal #Closed

public enum UngroupMode
{
/// <summary>
/// The number of output rows are equal to the minimum length of pivot columns
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are [](start = 42, length = 3)

is equal #Closed

@@ -26,6 +26,10 @@

namespace Microsoft.ML.Runtime.DataPipe
{
/// <summary>
/// This transform is used to mark some of the columns (e.g. Label) optional during training So that the columns is not required during scoring..
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So [](start = 97, length = 2)

lower case "so" since this is in the middle of a sentence? #Closed

@@ -26,6 +26,10 @@

namespace Microsoft.ML.Runtime.DataPipe
{
/// <summary>
/// This transform is used to mark some of the columns (e.g. Label) optional during training So that the columns is not required during scoring..
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns is [](start = 109, length = 10)

"columns is" => "columns are" #Closed

@@ -26,6 +26,10 @@

namespace Microsoft.ML.Runtime.DataPipe
{
/// <summary>
/// This transform is used to mark some of the columns (e.g. Label) optional during training So that the columns is not required during scoring..
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. [](start = 147, length = 2)

One period instead of two. #Closed

@@ -26,6 +26,10 @@

namespace Microsoft.ML.Runtime.DataPipe
{
/// <summary>
/// This transform is used to mark some of the columns (e.g. Label) optional during training So that the columns is not required during scoring..
/// At scoring time, it is checked if the data schema for scoring matches the data used for training except for the optional columns.
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At scoring time, it is checked if the data schema for scoring matches the data used for training except for the optional columns. [](start = 8, length = 129)

I'm not sure I actually understand this. What it really does is, if the column is not there, it will make it appear as if it is there with metadata filled in, and if the column is requested will return entirely default values. Or so I understand the code in the cursor. Is there some way to describe it in a way that communicates that? I'm not sure this sentence here actually communicates that. #Closed

public class OptionalColumnTransform : RowToRowMapperTransformBase
/// <summary>
/// This transform is used to mark some of the columns (e.g. Label) optional during training so that the columns are not required during scoring.
/// When applied to new data, if optional columns are not present a meta column is created having the same properties (e.g. 'name', 'type' etc.) as used during training.
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta column [](start = 72, length = 11)

The phrase meta-column is a bit odd, I'm not sure that's a good description. Maybe dummy column? Mock column? #Closed

/// - scalar for scalar column
/// - totally sparse vector for vector column.
/// If value of the column is requested the default value will be returned.
/// </summary>
Copy link
Contributor

@TomFinley TomFinley Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last line seems redundant. #Closed

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 16, 2018

Thanks @TomFinley and @codemzs!

@zeahmed zeahmed merged commit c491651 into dotnet:master Jul 16, 2018
@zeahmed zeahmed deleted the convenience_constructor3 branch July 17, 2018 17:42
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Part 3] Create convenience constructor for the listed Transforms.
3 participants