Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support sweeping multiline option in AutoML #5148

Merged
merged 9 commits into from
May 21, 2020

Conversation

LittleLittleCloud
Copy link
Contributor

No description provided.

@LittleLittleCloud LittleLittleCloud requested a review from a team as a code owner May 20, 2020 19:08
var foundAny = false;
var result = default(ColumnSplitResult);
foreach (var perm in (from _allowSparse in sparse
from _allowQuote in quote
from _sep in separatorCandidates
select new { _allowSparse, _allowQuote, _sep }))
from _tryMultiline in tryMultiline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming pattern would be:

Suggested change
from _tryMultiline in tryMultiline
from _allowMultiline in multiline

",1,2



Copy link
Contributor

@justinormont justinormont May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still some missing, like:

public static ColumnInferenceResults InferColumns(MLContext context, string path, string labelColumn,
char? separatorChar, bool? allowQuotedStrings, bool? supportSparse, bool trimWhitespace, bool groupColumns)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not exposing readMultiline option in these APIs, Simply sweeping it in Infersplit should be enough

Comment on lines +1 to +3
id,Column1,Column2,Column3
1,this is a description, 1,2
2,"this is a quote description",1,2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pinned to unrelated code to allow threaded discussion)

@harishsk and I were under the impression that in order for ModelBuilder to support the new readMultilines option, then some changes might be needed on the CodeGenerator, as now the code generated by ModelBuilder should also include readMultilines = true if it was found necessary to use that parameter.

Are these changes not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no... CodeGen project should only refer to ML.Net packages that are already published. That's why we want to pin CodeGen to a specific version of ML.Net to avoid any API changes breaking CodeGen. So from this aspect we need to upgrade CodeGen. Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.

My suggestion is to not update CodeGen this time, becausse it won't affect a lot. CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then it's ok.

CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.

Using the generic LoadFromTextFile API you mention will still use the default value for readMultilines, which is set to false. It would be needed to actually set it explicitly to true if it was necessary.

Notice that up until recently LoadFromTextFile only accepted some parameters on its signature, and I couldn't add the new readMultilines parameter there because that would be a breaking API change (which we are not allowed to do until ML.NET version 2). So instead, what I did was to add a new LoadFromTextFile(path, options) overload, where it accepts a TextLoader.Options object (see #5134).

So if users wanted to use the generic LoadFromTextFile<> and also use the new options (such as readMultilines, or the ones being introduced in #5147 or #5145) then they will have to use the new public overload. In this sense, I think that CodeGenerator should also start using this other overload to have access to new options.

Copy link
Contributor

@justinormont justinormont May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing part of it, if the CLI / Model Builder calls this new version of AutoML, you will have to update CodeGen.

Otherwise the multi-line files will be read by the AutoML code, but the CodeGen'd code will not read the user's file.
(update: I see @antoniovs1029 added a comment before me which also explains this)

CodeGen project should only refer to ML.Net packages that are already published.

All of Microsoft.ML.* should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.

When I import Microsoft.ML.CodeGeneration@0.17.0 into my project, I'm expecting it to reference Microsoft.ML@1.5.0 & Microsoft.ML.AutoML@0.17.0. Lets say my project has Microsoft.ML@1.5.0 to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both Microsoft.ML@1.5.0 and Microsoft.ML@1.4.x, causing a version conflict. Consistent releases are better.

Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.

What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?

Copy link
Contributor Author

@LittleLittleCloud LittleLittleCloud May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?

It's not CLI/ModelBuilder depend on nightly build, in short, it's the project generated from CLI/ModelBuilder needs to depend on a released build of ML.Net

All of Microsoft.ML.* should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.
Hardly agree, there's no guarantee that every change in AutoML/ML.Net will be released, while projects generated by CodeGen should refer to a released version of ML.Net. If we mirroring every change in AutoML/ML.Net to CodeGen, that will make CodeGen in an unusable state between each release because ML.Net/AutoML CodeGen referred is different from what it's generated project referred.

When I import Microsoft.ML.CodeGeneration@0.17.0 into my project, I'm expecting it to reference Microsoft.ML@1.5.0 & Microsoft.ML.AutoML@0.17.0. Lets say my project has Microsoft.ML@1.5.0 to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both Microsoft.ML@1.5.0 and Microsoft.ML@1.4.x, causing a version conflict. Consistent releases are better.

I don't see why it's related, if you want to call CodeGen, you need to call CodeGen that pin to Microsoft.ML@1.5.0 instead of Microsoft.ML@1.4.0

In fact, if we can make sure before each release, CodeGen will be the final package to update, and during two releases, we don't make any changes to CodeGen, then consistent release works. But if during two releases there's any breaking change in API, then we must and mustn't update CodeGenerator because of the reason I mention above, either change or not change CodeGen will make it in a break state.

One solution to solve this dilemma is to pin CodeGen to a released version, but that will break consistently release and we need to publish twice for each release cycle. One for ML.Net packages and one for CodeGen.

Another solution is to make sure ModelBuilder pins to a released version of AutoML/CodeGen, but that means between two releases we can't fix bugs resulted by CodeGen and that's again no guarantee.

Meanwhile, we have the same dilemma for AutoML too, but unlike codegen, we can use internal build for AutoML, and most of it's bug is not breaking bug, so that's not big deal.

A few more words, sorry to be verbose here.
I don't see why CodeGen have to refer to AutoML/ML.Net, the only relationship between CodeGen and AutoML is Pipline class (maybe also TextLoaderOption class), which should be defined in CodeGen as a contract class. And on the contrary, AutoML should refer to CodeGen to tell CodeGen how to generate projects. By doing that we can make CodeGen independent of ML.Net and resolve the dilemma we have now.

Copy link
Member

@antoniovs1029 antoniovs1029 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Great to see that in order to fix the InferColumns() scenario mentioned here it wasn't needed to update all the methods in the InferColumns internal API. So now this actually fixes #4460.

Just let's remember to update CodeGen once the release is done, to address my comment here.

And regarding this comment by @justinormont , I see that you didn't updated it everywhere were his searches pointed to, but I'm guessing it is not necessary for ModelBuilder team. In any case, since this PR already fixes #4460 I'm fine with the current state of the PR and would only expect @justinormont 's sign off on that other regard.


public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowSparse, int columnCount)
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following naming convention would be:

Suggested change
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount)
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowMultiline, bool allowSparse, int columnCount)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader Option uses ReadMultilines (So is allowQuote and allowSparse), and we'd better follow their naming style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

Co-authored-by: Justin Ormont <justinormont@users.noreply.github.com>
@harishsk harishsk merged commit e3ca7e0 into dotnet:master May 21, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants