support sweeping multiline option in AutoML #5148

LittleLittleCloud · 2020-05-20T19:08:27Z

No description provided.

justinormont · 2020-05-20T19:24:58Z

src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs

            var foundAny = false;
            var result = default(ColumnSplitResult);
            foreach (var perm in (from _allowSparse in sparse
                                    from _allowQuote in quote
                                    from _sep in separatorCandidates
-                                    select new { _allowSparse, _allowQuote, _sep }))
+                                    from _tryMultiline in tryMultiline


naming pattern would be:

Suggested change

from _tryMultiline in tryMultiline

from _allowMultiline in multiline

justinormont · 2020-05-20T19:29:17Z

test/Microsoft.ML.AutoML.Tests/TestData/DatasetWithNewlineBetweenQuotes.txt

+",1,2
+
+
+


There is a variety of other places the multi-line needs to be added.

Basically anywhere "sparse" is found within AutoML/CodeGen:

https://github.com/dotnet/machinelearning/search?q=sparse+path%3Asrc%2FMicrosoft.ML.AutoML&unscoped_q=sparse+path%3Asrc%2FMicrosoft.ML.AutoML

https://github.com/dotnet/machinelearning/search?q=sparse+path%3Asrc%2FMicrosoft.ML.CodeGenerator&unscoped_q=sparse+path%3Asrc%2FMicrosoft.ML.CodeGenerator

(pinned to line of code to allow for discussion)

There's still some missing, like:

machinelearning/src/Microsoft.ML.AutoML/ColumnInference/ColumnInferenceApi.cs

Lines 31 to 32 in c025270

public static ColumnInferenceResults InferColumns(MLContext context, string path, string labelColumn,

char? separatorChar, bool? allowQuotedStrings, bool? supportSparse, bool trimWhitespace, bool groupColumns)

I prefer not exposing readMultiline option in these APIs, Simply sweeping it in Infersplit should be enough

test/Microsoft.ML.AutoML.Tests/ColumnInferenceTests.cs

antoniovs1029 · 2020-05-20T19:36:27Z

test/Microsoft.ML.AutoML.Tests/TestData/DatasetWithNewlineBetweenQuotes.txt

+id,Column1,Column2,Column3
+1,this is a description, 1,2
+2,"this is a quote description",1,2


(pinned to unrelated code to allow threaded discussion)

@harishsk and I were under the impression that in order for ModelBuilder to support the new readMultilines option, then some changes might be needed on the CodeGenerator, as now the code generated by ModelBuilder should also include readMultilines = true if it was found necessary to use that parameter.

Are these changes not necessary?

Yes and no... CodeGen project should only refer to ML.Net packages that are already published. That's why we want to pin CodeGen to a specific version of ML.Net to avoid any API changes breaking CodeGen. So from this aspect we need to upgrade CodeGen. Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.

My suggestion is to not update CodeGen this time, becausse it won't affect a lot. CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.

I see, then it's ok.

CodeGenerated project uses LoadFromTextFile API with user-defined ModelInput Class, which should work even if there's newline between quotes if I recall correctly.

Using the generic LoadFromTextFile API you mention will still use the default value for readMultilines, which is set to false. It would be needed to actually set it explicitly to true if it was necessary.

Notice that up until recently LoadFromTextFile only accepted some parameters on its signature, and I couldn't add the new readMultilines parameter there because that would be a breaking API change (which we are not allowed to do until ML.NET version 2). So instead, what I did was to add a new LoadFromTextFile(path, options) overload, where it accepts a TextLoader.Options object (see #5134).

So if users wanted to use the generic LoadFromTextFile<> and also use the new options (such as readMultilines, or the ones being introduced in #5147 or #5145) then they will have to use the new public overload. In this sense, I think that CodeGenerator should also start using this other overload to have access to new options.

Unless I'm missing part of it, if the CLI / Model Builder calls this new version of AutoML, you will have to update CodeGen.

Otherwise the multi-line files will be read by the AutoML code, but the CodeGen'd code will not read the user's file.
(update: I see @antoniovs1029 added a comment before me which also explains this)

CodeGen project should only refer to ML.Net packages that are already published.

All of Microsoft.ML.* should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.

When I import Microsoft.ML.CodeGeneration@0.17.0 into my project, I'm expecting it to reference Microsoft.ML@1.5.0 & Microsoft.ML.AutoML@0.17.0. Lets say my project has Microsoft.ML@1.5.0 to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both Microsoft.ML@1.5.0 and Microsoft.ML@1.4.x, causing a version conflict. Consistent releases are better.

Meanwhile we can't change CodeGen before the release. Because if we update CodeGen, we need to refer to a nightly build ML.Net for generating project, which is not available externally.

What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?

What's stopping CodeGen from being updated before the release? Why would a change to CodeGen cause CLI/ModelBuilder to depend on a nightly build instead of the released version?

It's not CLI/ModelBuilder depend on nightly build, in short, it's the project generated from CLI/ModelBuilder needs to depend on a released build of ML.Net

All of Microsoft.ML.* should be self consistent with each other. No package pinning needed. Almost all changes to the AutoML code should have a mirroring change to the CodeGen code.
Hardly agree, there's no guarantee that every change in AutoML/ML.Net will be released, while projects generated by CodeGen should refer to a released version of ML.Net. If we mirroring every change in AutoML/ML.Net to CodeGen, that will make CodeGen in an unusable state between each release because ML.Net/AutoML CodeGen referred is different from what it's generated project referred.

When I import Microsoft.ML.CodeGeneration@0.17.0 into my project, I'm expecting it to reference Microsoft.ML@1.5.0 & Microsoft.ML.AutoML@0.17.0. Lets say my project has Microsoft.ML@1.5.0 to do prediction, then I want to call CodeGen (yes, only an internal call currently). Then my project needs both Microsoft.ML@1.5.0 and Microsoft.ML@1.4.x, causing a version conflict. Consistent releases are better.

I don't see why it's related, if you want to call CodeGen, you need to call CodeGen that pin to Microsoft.ML@1.5.0 instead of Microsoft.ML@1.4.0

In fact, if we can make sure before each release, CodeGen will be the final package to update, and during two releases, we don't make any changes to CodeGen, then consistent release works. But if during two releases there's any breaking change in API, then we must and mustn't update CodeGenerator because of the reason I mention above, either change or not change CodeGen will make it in a break state.

One solution to solve this dilemma is to pin CodeGen to a released version, but that will break consistently release and we need to publish twice for each release cycle. One for ML.Net packages and one for CodeGen.

Another solution is to make sure ModelBuilder pins to a released version of AutoML/CodeGen, but that means between two releases we can't fix bugs resulted by CodeGen and that's again no guarantee.

Meanwhile, we have the same dilemma for AutoML too, but unlike codegen, we can use internal build for AutoML, and most of it's bug is not breaking bug, so that's not big deal.

A few more words, sorry to be verbose here.
I don't see why CodeGen have to refer to AutoML/ML.Net, the only relationship between CodeGen and AutoML is Pipline class (maybe also TextLoaderOption class), which should be defined in CodeGen as a contract class. And on the contrary, AutoML should refer to CodeGen to tell CodeGen how to generate projects. By doing that we can make CodeGen independent of ML.Net and resolve the dilemma we have now.

antoniovs1029

LGTM.

Great to see that in order to fix the InferColumns() scenario mentioned here it wasn't needed to update all the methods in the InferColumns internal API. So now this actually fixes #4460.

Just let's remember to update CodeGen once the release is done, to address my comment here.

And regarding this comment by @justinormont , I see that you didn't updated it everywhere were his searches pointed to, but I'm guessing it is not necessary for ModelBuilder team. In any case, since this PR already fixes #4460 I'm fine with the current state of the PR and would only expect @justinormont 's sign off on that other regard.

test/Microsoft.ML.AutoML.Tests/ColumnInferenceTests.cs

justinormont · 2020-05-20T21:34:33Z

src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs


-            public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowSparse, int columnCount)
+            public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount)


Following naming convention would be:

Suggested change

public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount)

public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowMultiline, bool allowSparse, int columnCount)

TextLoader Option uses ReadMultilines (So is allowQuote and allowSparse), and we'd better follow their naming style.

Co-authored-by: Justin Ormont <justinormont@users.noreply.github.com>

LittleLittleCloud added 7 commits April 28, 2020 12:25

upgrade to 3.1

c073e6b

write inline data using invariantCulture

8f0fc1a

Merge branch 'master' of https://github.com/dotnet/machinelearning

e96d716

Merge branch 'master' of https://github.com/dotnet/machinelearning

8c17bbe

Merge branch 'master' of https://github.com/dotnet/machinelearning

b2947f5

Merge branch 'master' of https://github.com/dotnet/machinelearning

ff1c909

add tryMulti in AutoML and test

ecaaaf0

LittleLittleCloud requested a review from a team as a code owner May 20, 2020 19:08

justinormont reviewed May 20, 2020

View reviewed changes

antoniovs1029 reviewed May 20, 2020

View reviewed changes

test/Microsoft.ML.AutoML.Tests/ColumnInferenceTests.cs Show resolved Hide resolved

antoniovs1029 reviewed May 20, 2020

View reviewed changes

add test for AutoML inferColumn API

e7cb2b5

antoniovs1029 approved these changes May 20, 2020

View reviewed changes

justinormont reviewed May 20, 2020

View reviewed changes

test/Microsoft.ML.AutoML.Tests/ColumnInferenceTests.cs Outdated Show resolved Hide resolved

justinormont reviewed May 20, 2020

View reviewed changes

Update test/Microsoft.ML.AutoML.Tests/ColumnInferenceTests.cs

2c04d09

Co-authored-by: Justin Ormont <justinormont@users.noreply.github.com>

JakeRadMSFT approved these changes May 21, 2020

View reviewed changes

harishsk merged commit e3ca7e0 into dotnet:master May 21, 2020

antoniovs1029 mentioned this pull request May 28, 2020

[AutoML v0.16.0] InferColumn doesn't work on tricky csv file #4460

Closed

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support sweeping multiline option in AutoML #5148

support sweeping multiline option in AutoML #5148

Uh oh!

LittleLittleCloud commented May 20, 2020

Uh oh!

justinormont May 20, 2020

Uh oh!

justinormont May 20, 2020 •

edited

Loading

Uh oh!

LittleLittleCloud May 20, 2020

Uh oh!

justinormont May 20, 2020

Uh oh!

LittleLittleCloud May 20, 2020

Uh oh!

Uh oh!

antoniovs1029 May 20, 2020

Uh oh!

LittleLittleCloud May 20, 2020

Uh oh!

antoniovs1029 May 20, 2020

Uh oh!

justinormont May 20, 2020 •

edited

Loading

Uh oh!

LittleLittleCloud May 21, 2020 •

edited

Loading

Uh oh!

antoniovs1029 left a comment

Uh oh!

Uh oh!

justinormont May 20, 2020

Uh oh!

LittleLittleCloud May 21, 2020

Uh oh!

LittleLittleCloud May 21, 2020

Uh oh!

Uh oh!

	from _tryMultiline in tryMultiline
	from _allowMultiline in multiline

	public static ColumnInferenceResults InferColumns(MLContext context, string path, string labelColumn,
	char? separatorChar, bool? allowQuotedStrings, bool? supportSparse, bool trimWhitespace, bool groupColumns)


		public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool allowSparse, int columnCount)
		public ColumnSplitResult(bool isSuccess, char? separator, bool allowQuote, bool readMultilines, bool allowSparse, int columnCount)

		",1,2

support sweeping multiline option in AutoML #5148

support sweeping multiline option in AutoML #5148

Uh oh!

Conversation

LittleLittleCloud commented May 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Resolved

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LittleLittleCloud May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Resolved

Uh oh!

Uh oh!

justinormont May 20, 2020 •

edited

Loading

justinormont May 20, 2020 •

edited

Loading

LittleLittleCloud May 21, 2020 •

edited

Loading