Svmlight loader and saver #4190

yaeldekel · 2019-09-08T15:01:04Z

Fixes #4014 .

yaeldekel · 2019-09-08T15:12:36Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+    /// of the form "column=Name" to identify a column to convert. Ideally there would be some
+    /// other way to specify this other than hacking arguments. The intent of this is to allow
+    /// things like string keys, a common variant of the format, but one emphatically not allowed
+    /// by the original format.


This variation existed in the internal version of this loader, but I have not been able to find any references to it online. How important do you think it is to support this? #Resolved

As we discussed on the call, I think it is important to support this. Since SvmLightLoader is implemented as data loader, it is typically the first stage in the pipeline. If someone has data in this format, they will not be able to load the data into ML.NET if we don't support this.

In reply to: 322014712 [](ancestors = 322014712)

yaeldekel · 2019-09-08T15:12:57Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoaderSaverCatalog.cs

+    public static class SvmLightLoaderSaverCatalog
+    {
+        /// <summary>
+        ///


/// [](start = 8, length = 3)

Comments coming soon. #Resolved

codemzs · 2019-10-01T17:03:02Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+// <copyright company="Microsoft Corporation">
+//     Copyright (c) Microsoft Corporation. All rights reserved.
+// </copyright>
+//------------------------------------------------------------------------------


Please use the ML .NET header. #Resolved

codemzs · 2019-10-02T02:10:03Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoaderSaverCatalog.cs

+
+namespace Microsoft.ML.Transforms
+{
+    public static class SvmLightLoaderSaverCatalog


SvmLightLoaderSaverCatalog [](start = 24, length = 26)

Can we please add samples for all of these APIs? #Resolved

codemzs · 2019-10-02T02:10:43Z

@yaeldekel Can you please sync to master and push your changes so we can analyze the code coverage? thanks. #Resolved

harishsk · 2019-10-07T19:50:02Z

test/Microsoft.ML.Tests/SvmLightTests.cs

+
+        [Fact]
+        public void TestSvmLightLoaderAndSaver()
+        {


It appears that this function is encapsulating several tests. Is it possible to refactor them into different tests? #Resolved

Good suggestion, I'll do that in the next iteration.

In reply to: 332204210 [](ancestors = 332204210)

harishsk · 2019-10-07T19:55:11Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoaderSaverCatalog.cs

+        /// <param name="numberOfRows">The number of rows from the sample to be used for determining the set of feature names.</param>
+        /// <param name="dataSample">A data sample to be used for determining the set of features names.</param>
+        public static SvmLightLoader CreateSvmLightLoaderWithFeatureNames(this DataOperationsCatalog catalog,
+            long? numberOfRows = null,


CreateTextLoader two versions with different parameters. Is that a better practice to follow and call both functions CreateSvmLightLoader?

Do you think I should have one method that takes an SvmLightLoader.FeatureIndices as a parameter?
In that case, perhaps we could come up with a better name for that enum? Any suggestions?

In reply to: 332206307 [](ancestors = 332206307)

Yes, I think having a method that takes FeatureIndices as a parameter would be helpful to distinguish the two cases. One option would be to just call the enum as Features and call the enum values as ZeroBasedIndices, OneBaseIndices and Names.
Also, both the methods aboe have numberOfRows as a parameter, but they are in different positions. If the two functions have the same parameter, it would be good to keep them in the same position.

In reply to: 333653705 [](ancestors = 333653705,332206307)

I've addressed the parameter-order part of your comment, but I had second thoughts about having a single API that specifies whether the file contains feature indices or feature names - there is a parameter inputSize which is only relevant in the feature indices case and not in the feature names case, and I'm not sure that having a parameter that is sometimes ignored is the best thing to have - do you have any thoughts on that?

In reply to: 337250107 [](ancestors = 337250107,333653705,332206307)

Based on your explanation, would reordering and renaming the params as follows work?

// For namesCreateSvmLightLoader to use with indices public static SvmLightLoader CreateSvmLightLoader( this DataOperationsCatalog catalog, long? numberOfRows = null, IMultiStreamSource dataSample = null, Features features, int featureSize, ); // For namesCreateSvmLightLoaderWithNames public static SvmLightLoader CreateSvmLightLoader( this DataOperationsCatalog catalog, long? numberOfRows = null, IMultiStreamSource dataSample = null, Features features );

The other option is to keep only one version, (whichever you think would be the more common one and change the other one to have the SvmLightLoader.Options class directly as a param. That too has been a pattern in ML.NET.

In either case I think input from @eerhardt would be useful for the public API surface.

In reply to: 343611210 [](ancestors = 343611210,337250107,333653705,332206307)

harishsk · 2019-10-07T19:55:36Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoaderSaverCatalog.cs

+        /// <param name="path">The path to the file.</param>
+        /// <param name="numberOfRows">The number of rows from the sample to be used for determining the set of feature names.</param>
+        public static IDataView LoadFromSvmLightFileWithFeatureNames(this DataOperationsCatalog catalog,
+            string path,


Similar comment as above.

harishsk · 2019-10-07T19:59:19Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+            public VBuffer<int> FeatureKeys;
+
+            public static void ParseIndicesToOneBased(IntermediateInput input, Indices output)
+            {


The implementations of the two functions in this class are nearly identical. Can they be refactored into a single function that takes as a parameter the offset to use? #Resolved

harishsk · 2019-10-07T20:00:33Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+            }
+
+            public static void ParseIndicesToZeroBased(IntermediateInput input, Indices output)
+            {


What happens if the caller asks that the data be be parsed with a one based index, but the data has a zero in its list of keys? #Resolved

The way it is written now, the feature will be ignored. (see line 390).
Do you think we should throw?

In reply to: 332208505 [](ancestors = 332208505)

harishsk · 2019-10-07T20:01:51Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+                {
+                    if (Conversions.Instance.TryParse(in inputValues[i], out int index) && index >= 0)
+                        editor.Values[i] = index;
+                    else


What is the desired behavior if the data is bad? E.g. if the data has duplicate keys? (Maybe throw an exception if the other data loaders throw an exception when the data is loaded?) #Resolved

I added support for keeping track which indices exist in each row (see the OutputMapper class in line 359). Currently, I am not throwing in case of duplicate keys, just using the value in the first appearance.
Do you think I should throw instead?

In reply to: 332209060 [](ancestors = 332209060)

For the both the questions above, I would defer to precedence. If other data loaders throw when bad data is encountered, we should throw here as well.

In reply to: 333656937 [](ancestors = 333656937,332209060)

harishsk · 2019-10-07T20:02:56Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+        /// <see cref="IntermediateOut"/> is used.
+        /// </summary>
+        private sealed class IntermediateOutKeys
+        {


Do we need these two nearly identical classes? Or can they be made into a generic class? #Resolved

harishsk · 2019-10-07T20:03:46Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+        }
+
+        private sealed class IntermediateOut
+        {


Or as we discussed, is it possible to have indices be uint always, since the indices into the array cannot be negative? #Resolved

harishsk · 2019-10-07T20:04:42Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+        private sealed class Output
+        {
+            public VBuffer<float> Features;
+        }


I think you mentioned that this is a dense array. Would it be better to keep it as a sparse array so that the algorithm that is making use of this data can decide at a later time if it needs to convert this into a dense array?

I switched to using a BufferBuilder instead of the VBufferEditor, since it knows how to handle indices out of order. There might be perf implications though, not sure which is better to use in this case.

In reply to: 332210125 [](ancestors = 332210125)

harishsk · 2019-10-07T20:10:18Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+        }
+
+        private sealed class Output
+        {


Is there a use case to (say after inference, or during evaluation) to map these indices back to keys or the original strings? In that case, does it make sense to provide a separate function/helper to extract the keys/strings back? #Resolved

In case of using string keys to load the data, the resulting IDataView schema will have slot names annotations for the Features column, so the feature names can be extracted this way.

In reply to: 332212446 [](ancestors = 332212446)

(see in CreateOutputTransformer inside the if (keyIndices) part).

In reply to: 343596251 [](ancestors = 343596251,332212446)

harishsk · 2019-10-07T20:11:26Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+        private readonly IHost _host;
+        private readonly ITransformer _keyVectorsToIndexVectors;
+        private readonly FeatureIndices _indicesKind;
+        private readonly ulong _featureCount;


_featureCount [](start = 31, length = 13)

Is it useful to make this public? #Resolved

I am not sure there is a need to do that. You can get the length of the feature vector when you call GetOutputSchema() on the loader.

In reply to: 332212933 [](ancestors = 332212933)

I'll mark this as resolved, if you think it's important to make public, please re-activate and let me know.

In reply to: 333650195 [](ancestors = 333650195,332212933)

harishsk

🕐

codecov · 2019-10-08T09:28:37Z

Codecov Report

Merging #4190 into master will increase coverage by 0.16%.
The diff coverage is 97.83%.

@@            Coverage Diff             @@
##           master    #4190      +/-   ##
==========================================
+ Coverage   75.12%   75.29%   +0.16%     
==========================================
  Files         909      913       +4     
  Lines      160240   161210     +970     
  Branches    17252    17329      +77     
==========================================
+ Hits       120377   121377    +1000     
+ Misses      35049    35016      -33     
- Partials     4814     4817       +3

Flag	Coverage Δ
#Debug	`75.29% <97.83%> (+0.16%)`	⬆️
#production	`70.66% <95.98%> (+0.14%)`	⬆️
#test	`90.43% <100%> (+0.14%)`	⬆️

Impacted Files	Coverage Δ
....Transforms/SvmLight/SvmLightLoaderSaverCatalog.cs	`100% <100%> (ø)`
test/Microsoft.ML.TestFramework/TestCommandBase.cs	`43.51% <100%> (+2.15%)`	⬆️
test/Microsoft.ML.Tests/SvmLightTests.cs	`100% <100%> (ø)`
.../Microsoft.ML.Transforms/SvmLight/SvmLightSaver.cs	`95.69% <95.69%> (ø)`
...Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs	`95.75% <95.75%> (ø)`
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs	`79.48% <0%> (-20.52%)`	⬇️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs	`83.19% <0%> (-3.37%)`	⬇️
src/Microsoft.ML.AutoML/Sweepers/ISweeper.cs	`67.08% <0%> (-2.54%)`	⬇️
src/Microsoft.ML.AutoML/Sweepers/Parameters.cs	`84.32% <0%> (-0.85%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`84.9% <0%> (-0.41%)`	⬇️
... and 12 more

codemzs · 2019-10-08T17:11:25Z

@yaeldekel Thanks for syncing to master so that code coverage report is generated. Looking at code coverage report it seems there are branches in your code that is not covered by a test, can you please take a look here https://codecov.io/gh/dotnet/machinelearning/pull/4190/diff (on left side it will highlight lines in red) thanks! #Resolved

yaeldekel · 2019-10-10T17:47:05Z

Yes, I'll take a look. Question - we normally don't test all the cases where exceptions are thrown, do we?

In reply to: 539612360 [](ancestors = 539612360)

harishsk · 2019-10-10T18:08:29Z

Code coverage as part of continuous integration has recently come in so I think we get to collectively define what the policy should be going forward and your thoughts and opinions on the issue would be most welcome.

In reply to: 540696889 [](ancestors = 540696889,539612360)

codemzs · 2019-10-10T18:17:13Z

We definitely need to test for negative cases, i.e cases where exception is thrown, I can show you in the code base where we already do that. @eerhardt and @sharwell can probably also chip in and provide feedback.

In reply to: 540705504 [](ancestors = 540705504,540696889,539612360)

harishsk · 2019-10-10T18:30:42Z

I am sure there are tests for negative cases in our code bases where exceptions are thrown. My question is, whether we have been doing it consistently throughout so far. And if not, is that the new policy you are proposing?

In reply to: 540708878 [](ancestors = 540708878,540705504,540696889,539612360)

sharwell · 2019-10-10T18:52:48Z

I generally encourage the creation of tests to cover these cases. Discretion should be used if there is a case that is not easy to hit (we don't want to spend more than a reasonable amount of time trying to find a way to hit it). Where possible, the negative tests should use the normal API entry points, as opposed to side-stepping the API for direct calls to internals/privates. This helps ensure that the error condition which leads to the negative result does not result in abnormal behavior prior to the expected error handling locations. #Resolved

codemzs · 2019-10-10T19:59:13Z

@sharwell Agreed. @yaeldekel lets try to add negative test. #Resolved

eerhardt · 2019-12-10T22:00:17Z

// Licensed to the .NET Foundation under one or more agreements.

Any reason these are in ML.Transforms and not ML.Data? All our other "loaders" are in .Data, correct?

Refers to: src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs:1 in 7358266. [](commit_id = 7358266, deletion_comment = False)

eerhardt · 2019-12-10T22:01:24Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+            public long? NumberOfRows;
+        }
+
+#pragma warning disable 0649 // Disable warnings about unused members. They are used through reflection.


Can you just do this around the unused members instead of a huge block? #Resolved

eerhardt · 2019-12-10T22:01:37Z

src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs

+            Names
+        }
+
+        public sealed class Options


xml comments on all new public API

Do these types (FeatureIndices and Options) need to be public? I don't see them being used publicly.

In reply to: 356301711 [](ancestors = 356301711)

Currently they don't need to be public. There was a discussion about whether there should be a public API that takes an Options parameter, or a FeatureIndices parameter (instead of having one called CreateSvmLightLoader and one called CreateSvmLightLoaderWithFeatureNames). Do you have a preference?

In reply to: 356304532 [](ancestors = 356304532,356301711)

Do all combinations of enums and options make sense together? Or do some options only make sense for certain values of the enum?

If they all make sense together then I would expose the Options. But if there are options that don’t make sense together, then I would keep it like you have it.

yaeldekel · 2019-12-11T11:39:31Z

// Licensed to the .NET Foundation under one or more agreements.

The loader uses some components from Microsoft.ML.Transforms (specifically, CustomMappingTransformer). How critical is it do you think for SvmLightLoader not to be in the transformers assembly? The only other solution I can think of, is to add a new project that would reference both ML.Data and ML.Transforms.

In reply to: 564278518 [](ancestors = 564278518)

Refers to: src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs:1 in 7358266. [](commit_id = 7358266, deletion_comment = False)

eerhardt · 2019-12-11T13:20:12Z

I don’t think the assembly it lives in is critical. It just struck me as odd. Thanks for answering my question.

codemzs · 2019-12-26T18:14:28Z

docs/samples/Microsoft.ML.Samples/Dynamic/DataOperations/LoadingSvmLight.cs

+{
+    public static class LoadingSvmLight
+    {
+        // This examples shows all the ways to load data with TextLoader.


TextLoader. [](start = 62, length = 11)

Text loader?

codemzs

yaeldekel requested review from harishsk and eerhardt September 8, 2019 15:01

yaeldekel requested a review from a team as a code owner September 8, 2019 15:01

yaeldekel commented Sep 8, 2019

View reviewed changes

codemzs reviewed Oct 1, 2019

View reviewed changes

codemzs reviewed Oct 2, 2019

View reviewed changes

harishsk reviewed Oct 7, 2019

View reviewed changes

harishsk suggested changes Oct 7, 2019

View reviewed changes

yaeldekel force-pushed the svmlight branch from 4f01da3 to 4c2d59c Compare October 8, 2019 08:33

yaeldekel force-pushed the svmlight branch from 7e6c828 to 6a1c7fd Compare November 7, 2019 09:14

eerhardt reviewed Dec 10, 2019

View reviewed changes

yaeldMS added 8 commits December 11, 2019 13:42

SvmLightLoader

722e71e

SvmLightLoader, SvmLightSaver and tests

8d106cf

Add comments to public API methods

cb2e2e1

Address code review comments and add a sample

b4cffed

address code review comments and add more tests

1fdb289

Code review comments

ee1b17e

SvmLightLoader

d7ae208

SvmLightLoader, SvmLightSaver and tests

d2619ed

yaeldekel force-pushed the svmlight branch from 7358266 to d2619ed Compare December 11, 2019 11:44

codemzs reviewed Dec 26, 2019

View reviewed changes

codemzs approved these changes Dec 26, 2019

View reviewed changes

codemzs merged commit 65e6acd into dotnet:master Dec 26, 2019

ghost locked as resolved and limited conversation to collaborators Mar 20, 2022

Svmlight loader and saver #4190

Svmlight loader and saver #4190

Uh oh!

Conversation

yaeldekel commented Sep 8, 2019

Uh oh!

yaeldekel Sep 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaeldekel Sep 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codemzs Oct 1, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codemzs Oct 2, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codemzs commented Oct 2, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harishsk Oct 7, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Oct 7, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Oct 7, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Oct 7, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Oct 7, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Oct 7, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk Oct 7, 2019 • edited by yaeldekel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yaeldekel Sep 8, 2019 •

edited

Loading

yaeldekel Sep 8, 2019 •

edited

Loading

codemzs Oct 1, 2019 •

edited by yaeldekel

Loading

codemzs Oct 2, 2019 •

edited by yaeldekel

Loading

codemzs commented Oct 2, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

harishsk Oct 7, 2019 •

edited by yaeldekel

Loading

codecov bot commented Oct 8, 2019 •

edited

Loading

codemzs commented Oct 8, 2019 •

edited by yaeldekel

Loading

codemzs commented Oct 10, 2019 •

edited

Loading

sharwell commented Oct 10, 2019 •

edited by yaeldekel

Loading

codemzs commented Oct 10, 2019 •

edited by yaeldekel

Loading

eerhardt Dec 10, 2019 •

edited by yaeldekel

Loading