-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support sweeping multiline option in AutoML #5148
support sweeping multiline option in AutoML #5148
Conversation
var foundAny = false; | ||
var result = default(ColumnSplitResult); | ||
foreach (var perm in (from _allowSparse in sparse | ||
from _allowQuote in quote | ||
from _sep in separatorCandidates | ||
select new { _allowSparse, _allowQuote, _sep })) | ||
from _tryMultiline in tryMultiline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming pattern would be:
from _tryMultiline in tryMultiline | |
from _allowMultiline in multiline |
",1,2 | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a variety of other places the multi-line needs to be added.
Basically anywhere "sparse" is found within AutoML/CodeGen:
- https://github.com/dotnet/machinelearning/search?q=sparse+path%3Asrc%2FMicrosoft.ML.AutoML&unscoped_q=sparse+path%3Asrc%2FMicrosoft.ML.AutoML
- https://github.com/dotnet/machinelearning/search?q=sparse+path%3Asrc%2FMicrosoft.ML.CodeGenerator&unscoped_q=sparse+path%3Asrc%2FMicrosoft.ML.CodeGenerator
(pinned to line of code to allow for discussion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still some missing, like:
machinelearning/src/Microsoft.ML.AutoML/ColumnInference/ColumnInferenceApi.cs
Lines 31 to 32 in c025270
public static ColumnInferenceResults InferColumns(MLContext context, string path, string labelColumn, | |
char? separatorChar, bool? allowQuotedStrings, bool? supportSparse, bool trimWhitespace, bool groupColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not exposing readMultiline option in these APIs, Simply sweeping it in Infersplit should be enough
id,Column1,Column2,Column3 | ||
1,this is a description, 1,2 | ||
2,"this is a quote description",1,2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pinned to unrelated code to allow threaded discussion)
@harishsk and I were under the impression that in order for ModelBuilder to support the new readMultilines option, then some changes might be needed on the CodeGenerator, as now the code generated by ModelBuilder should also include readMultilines = true
if it was found necessary to use that parameter.
Are these changes not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no... CodeGen project should only refer to ML.Net packages that are already published. That's why we want to pin CodeGen to a specific version of ML.Net to avoid any API changes breaking CodeGen. So from this aspect we need to upgrade CodeGen. Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.
My suggestion is to not update CodeGen this time, becausse it won't affect a lot. CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, then it's ok.
CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.
Using the generic LoadFromTextFile API you mention will still use the default value for readMultilines, which is set to false. It would be needed to actually set it explicitly to true if it was necessary.
Notice that up until recently LoadFromTextFile only accepted some parameters on its signature, and I couldn't add the new readMultilines parameter there because that would be a breaking API change (which we are not allowed to do until ML.NET version 2). So instead, what I did was to add a new LoadFromTextFile(path, options) overload, where it accepts a TextLoader.Options object (see #5134).
So if users wanted to use the generic LoadFromTextFile<> and also use the new options (such as readMultilines, or the ones being introduced in #5147 or #5145) then they will have to use the new public overload. In this sense, I think that CodeGenerator should also start using this other overload to have access to new options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing part of it, if the CLI / Model Builder calls this new version of AutoML, you will have to update CodeGen.
Otherwise the multi-line files will be read by the AutoML code, but the CodeGen'd code will not read the user's file.
(update: I see @antoniovs1029 added a comment before me which also explains this)
CodeGen project should only refer to ML.Net packages that are already published.
All of Microsoft.ML.*
should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.
When I import Microsoft.ML.CodeGeneration@0.17.0
into my project, I'm expecting it to reference Microsoft.ML@1.5.0
& Microsoft.ML.AutoML@0.17.0
. Lets say my project has Microsoft.ML@1.5.0
to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both Microsoft.ML@1.5.0
and Microsoft.ML@1.4.x
, causing a version conflict. Consistent releases are better.
Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.
What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?
It's not CLI/ModelBuilder depend on nightly build, in short, it's the project generated from CLI/ModelBuilder needs to depend on a released build of ML.Net
All of Microsoft.ML.* should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.
Hardly agree, there's no guarantee that every change in AutoML/ML.Net will be released, while projects generated by CodeGen should refer to a released version of ML.Net. If we mirroring every change in AutoML/ML.Net to CodeGen, that will make CodeGen in an unusable state between each release because ML.Net/AutoML CodeGen referred is different from what it's generated project referred.
When I import Microsoft.ML.CodeGeneration@0.17.0 into my project, I'm expecting it to reference Microsoft.ML@1.5.0 & Microsoft.ML.AutoML@0.17.0. Lets say my project has Microsoft.ML@1.5.0 to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both Microsoft.ML@1.5.0 and Microsoft.ML@1.4.x, causing a version conflict. Consistent releases are better.
I don't see why it's related, if you want to call CodeGen, you need to call CodeGen that pin to Microsoft.ML@1.5.0
instead of Microsoft.ML@1.4.0
In fact, if we can make sure before each release, CodeGen will be the final package to update, and during two releases, we don't make any changes to CodeGen, then consistent release works. But if during two releases there's any breaking change in API, then we must and mustn't update CodeGenerator because of the reason I mention above, either change or not change CodeGen will make it in a break state.
One solution to solve this dilemma is to pin CodeGen to a released version, but that will break consistently release and we need to publish twice for each release cycle. One for ML.Net packages and one for CodeGen.
Another solution is to make sure ModelBuilder pins to a released version of AutoML/CodeGen, but that means between two releases we can't fix bugs resulted by CodeGen and that's again no guarantee.
Meanwhile, we have the same dilemma for AutoML too, but unlike codegen, we can use internal build for AutoML, and most of it's bug is not breaking bug, so that's not big deal.
A few more words, sorry to be verbose here.
I don't see why CodeGen have to refer to AutoML/ML.Net, the only relationship between CodeGen and AutoML is Pipline
class (maybe also TextLoaderOption
class), which should be defined in CodeGen as a contract class. And on the contrary, AutoML should refer to CodeGen to tell CodeGen how to generate projects. By doing that we can make CodeGen independent of ML.Net and resolve the dilemma we have now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Great to see that in order to fix the InferColumns() scenario mentioned here it wasn't needed to update all the methods in the InferColumns internal API. So now this actually fixes #4460.
Just let's remember to update CodeGen once the release is done, to address my comment here.
And regarding this comment by @justinormont , I see that you didn't updated it everywhere were his searches pointed to, but I'm guessing it is not necessary for ModelBuilder team. In any case, since this PR already fixes #4460 I'm fine with the current state of the PR and would only expect @justinormont 's sign off on that other regard.
|
||
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowSparse, int columnCount) | ||
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following naming convention would be:
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount) | |
public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowMultiline, bool allowSparse, int columnCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader Option uses ReadMultilines (So is allowQuote and allowSparse), and we'd better follow their naming style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
Co-authored-by: Justin Ormont <justinormont@users.noreply.github.com>
No description provided.