Skip to content

Xml docs for trainers and a minor infrastructure changes #455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 2, 2018

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Jun 29, 2018

Addresses: #388

This PR adds an EntryPointInfo attribute that will contain the XML documentation to append to the summary: "Description", "References", "See also" sections.
It also modifies the C# generation code to append the content of this new attribute to the summary.

sfilipi added 2 commits June 29, 2018 14:22
…ted Summary class XML.

Adding documentation details and references for another batch of the trainers.
@sfilipi
Copy link
Member Author

sfilipi commented Jun 29, 2018

This is a first commit, to allow starting to review the changes.
I'll add another commit with more trainers documentation. #Resolved

internal const string Summary = "K-means is a popular clustering algorithm. With K-means, the data is clustered into a specified "
+ "number of clusters in order to minimize the within-cluster sum of squares. K-means++ improves upon K-means by using a better "
+ "method for choosing the initial cluster centers.";
internal const string Summary = "KMeans++ clustering algorithm";
Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal const string Summary = "KMeans++ clustering algorithm"; [](start = 8, length = 64)

This summary isn't quite as useful. I understand that we're adding more descriptive text on remarks for the API, but this is a temporary solution as noted elsewhere, so why are we degrading the text that we do have? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other summaries are just one liner, like: ''Averaged Perceptron Binary Classifier." or "Train a field-aware factorization machine for binary classification".
This felt like the outlier. Restoring it.


In reply to: 199548768 [](ancestors = 199548768)

Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't know. I think people don't write good summaries because they're lazy, I wouldn't want to normalize that laziness when someone actually did a good job writing a summary. :) #Resolved

@@ -11,7 +11,7 @@
using Microsoft.ML.Runtime.LightGBM;
using Microsoft.ML.Runtime.Model;

[assembly: LoadableClass(LightGbmBinaryTrainer.Summary, typeof(LightGbmBinaryTrainer), typeof(LightGbmArguments),
[assembly: LoadableClass(LightGbmBinaryTrainer.UserName, typeof(LightGbmBinaryTrainer), typeof(LightGbmArguments),
Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UserName [](start = 47, length = 8)

The first thing I believe is the summary, or should be. If it was badly written, that's fine, we ought to improve it, but to use the user-name as the first argument is wrong.

Also the thing that is actually being filled in as the user name is a string literal (that somehow differs from the field declared as UserName. #Resolved

Elastic net regularization can be specified by the l2_weight and l1_weight parameters. Note that the l2_weight has an effect on the rate of convergence.
In general, the larger the l2_weight, the faster SDCA converges.";
The results depends on the order of the training data. For reproducible results, it is recommended that one sets <paramref>shuffle</paramref> to
False and <paramref>NumThreads</paramref> to 1.
Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumThreads [](start = 10, length = 31)

Is the correct form not <paramref href=?

Also this seems like kind of a strange solution... #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's actually . Thanks for spotting it.


In reply to: 199549798 [](ancestors = 199549798)

@@ -82,6 +82,31 @@ public abstract class FastTreeTrainerBase<TArgs, TPredictor> :

protected string InnerArgs => CmdParser.GetSettings(Host, Args, new TArgs());

internal const string Remarks = @"<remarks>
<para>FastTrees is an implementation of FastRank. FastRank is an efficient implementation of the <a href='https://arxiv.org/abs/1505.01866'>MART</a> gradient boosting algorithm.
Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"FastTrees is" -> "FastTree is"... Also they will have no idea what FastRank is. Bringing in FastRank just complicates the picture a bit. The long history of Microsoft's internal tools, while possibly interesting to some people, does not belong in the first few sentences of user document. #Resolved

In general, the larger the l2_weight, the faster SDCA converges.";
The results depends on the order of the training data. For reproducible results, it is recommended that one sets <paramref>shuffle</paramref> to
False and <paramref>NumThreads</paramref> to 1.
Elastic net regularization can be specified by the l2_weight and l1_weight parameters. Note that the <paramref>l2_weight</paramref> has an effect on the rate of convergence.
Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are l2_weight and l1_weight parameters? Those do not conform to .NET naming guidelines. Are they really named that? Also same note about <paramref href, maybe serach for <paramref> in your code to find and fix. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, spotted the l2_weight and l1_weight and changed them to the actual L2Const and L1Threeshold.

The usage of paramref in the codebase is correct.


In reply to: 199558256 [](ancestors = 199558256)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 2, 2018

So, just curious @sfilipi , if I had this task of generating temporary XML comments for these classes, I wouldn't have done it this way. I would have written a separate file in the API itself, with the classes I wanted to document as partial classes (they already are partial to allow for these sorts of shenanigans), with comments directly on the partial classes. We already use the partial trick class for this sort of thing. That will have a few advantages:

  1. Doesn't require this elaborate (and frankly strange) infrastructure,
  2. When we get rid of the old API, we can just copy the comments over more or less directly, rather than reforming strings against as XML comments.

What do you think?

@sfilipi
Copy link
Member Author

sfilipi commented Jul 2, 2018

The documentation that i was aiming at improving was in the generated CSharpApi; the classes in Microsoft.ML.Trainers and Microsoft.ML.Transforms namespaces.
So the comments would still need to live somewhere, and the CSharp generator would need a pointer to them through the EntryPoints infrastructure.
That's why I kept them with the classes themselves and wired them to the generated C#Api, through the EntryPoints.

Writting another class with all the partial classes by hand just for the comments seems like prone to oversight when it comes to adding the documentation.

Reading more about the C# docs, there is an tag that we can leverage, we'd keep the XML docs in the docs side, in XML, and just use this tag to wire the path to those files to the CSharp Api.
@shauheen what is your opinion?
Example: https://docs.microsoft.com/en-us/dotnet/csharp/codedoc#ltincludegt


In reply to: 401866820 [](ancestors = 401866820)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted provisionally, that we work on making this better at a later time.

internal const string DetailedSummary = @"Perceptron is a classification algorithm that makes its predictions based on a linear function.
internal const string Summary = "Averaged Perceptron Binary Classifier.";
internal const string Remarks = @"<remarks>
Perceptron is a classification algorithm that makes its predictions based on a linear function.
I.e., for an instance with feature values f0, f1,..., f_D-1, , the prediction is given by the sign of sigma[0,D-1] ( w_i * f_i), where w_0, w_1,...,w_D-1 are the weights computed by the algorithm.
Copy link
Contributor

@shauheen shauheen Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., for an instance [](start = 0, length = 21)

This is very hard to read. No need to fix now though. #Closed

Copy link
Contributor

@shauheen shauheen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sfilipi sfilipi merged commit b9086b6 into dotnet:master Jul 2, 2018
@sfilipi sfilipi deleted the xmlDocs branch July 2, 2018 23:15
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
* updating the C# api generator to append the Remarks XML to the generated Summary class XML.
Adding documentation details and references for another batch of the trainers.

* correcting test trigger condition

* typo

* More documentation

* substituting the <see> and <seealso> tags with <a> tags, since there is no official documentation on using href with those tags.

* adressing PR comments.
Fixing new line character in the ep_list and manifest.

* Fixing the list and code generation merges from master
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants