Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Svmlight loader and saver #4190

Merged
merged 8 commits into from
Dec 26, 2019
Merged

Svmlight loader and saver #4190

merged 8 commits into from
Dec 26, 2019

Conversation

yaeldekel
Copy link

Fixes #4014 .

@yaeldekel yaeldekel requested a review from a team as a code owner September 8, 2019 15:01
/// of the form "column=Name" to identify a column to convert. Ideally there would be some
/// other way to specify this other than hacking arguments. The intent of this is to allow
/// things like string keys, a common variant of the format, but one emphatically not allowed
/// by the original format.
Copy link
Author

@yaeldekel yaeldekel Sep 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variation existed in the internal version of this loader, but I have not been able to find any references to it online. How important do you think it is to support this? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed on the call, I think it is important to support this. Since SvmLightLoader is implemented as data loader, it is typically the first stage in the pipeline. If someone has data in this format, they will not be able to load the data into ML.NET if we don't support this.


In reply to: 322014712 [](ancestors = 322014712)

public static class SvmLightLoaderSaverCatalog
{
/// <summary>
///
Copy link
Author

@yaeldekel yaeldekel Sep 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// [](start = 8, length = 3)

Comments coming soon. #Resolved

// <copyright company="Microsoft Corporation">
// Copyright (c) Microsoft Corporation. All rights reserved.
// </copyright>
//------------------------------------------------------------------------------
Copy link
Member

@codemzs codemzs Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the ML .NET header. #Resolved


namespace Microsoft.ML.Transforms
{
public static class SvmLightLoaderSaverCatalog
Copy link
Member

@codemzs codemzs Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SvmLightLoaderSaverCatalog [](start = 24, length = 26)

Can we please add samples for all of these APIs? #Resolved

@codemzs
Copy link
Member

codemzs commented Oct 2, 2019

@yaeldekel Can you please sync to master and push your changes so we can analyze the code coverage? thanks. #Resolved


[Fact]
public void TestSvmLightLoaderAndSaver()
{
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this function is encapsulating several tests. Is it possible to refactor them into different tests? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I'll do that in the next iteration.


In reply to: 332204210 [](ancestors = 332204210)

/// <param name="numberOfRows">The number of rows from the sample to be used for determining the set of feature names.</param>
/// <param name="dataSample">A data sample to be used for determining the set of features names.</param>
public static SvmLightLoader CreateSvmLightLoaderWithFeatureNames(this DataOperationsCatalog catalog,
long? numberOfRows = null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateTextLoader two versions with different parameters. Is that a better practice to follow and call both functions CreateSvmLightLoader?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think I should have one method that takes an SvmLightLoader.FeatureIndices as a parameter?
In that case, perhaps we could come up with a better name for that enum? Any suggestions?


In reply to: 332206307 [](ancestors = 332206307)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think having a method that takes FeatureIndices as a parameter would be helpful to distinguish the two cases. One option would be to just call the enum as Features and call the enum values as ZeroBasedIndices, OneBaseIndices and Names.
Also, both the methods aboe have numberOfRows as a parameter, but they are in different positions. If the two functions have the same parameter, it would be good to keep them in the same position.


In reply to: 333653705 [](ancestors = 333653705,332206307)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed the parameter-order part of your comment, but I had second thoughts about having a single API that specifies whether the file contains feature indices or feature names - there is a parameter inputSize which is only relevant in the feature indices case and not in the feature names case, and I'm not sure that having a parameter that is sometimes ignored is the best thing to have - do you have any thoughts on that?


In reply to: 337250107 [](ancestors = 337250107,333653705,332206307)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your explanation, would reordering and renaming the params as follows work?

        // For namesCreateSvmLightLoader to use with indices
        public static SvmLightLoader CreateSvmLightLoader(
            this DataOperationsCatalog catalog,
            long? numberOfRows = null,
            IMultiStreamSource dataSample = null,
            Features features,
            int featureSize,
            );

        // For namesCreateSvmLightLoaderWithNames
        public static SvmLightLoader CreateSvmLightLoader( 
            this DataOperationsCatalog catalog,
            long? numberOfRows = null,
            IMultiStreamSource dataSample = null,
            Features features
            );

The other option is to keep only one version, (whichever you think would be the more common one and change the other one to have the SvmLightLoader.Options class directly as a param. That too has been a pattern in ML.NET.

In either case I think input from @eerhardt would be useful for the public API surface.


In reply to: 343611210 [](ancestors = 343611210,337250107,333653705,332206307)

/// <param name="path">The path to the file.</param>
/// <param name="numberOfRows">The number of rows from the sample to be used for determining the set of feature names.</param>
public static IDataView LoadFromSvmLightFileWithFeatureNames(this DataOperationsCatalog catalog,
string path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above.

public VBuffer<int> FeatureKeys;

public static void ParseIndicesToOneBased(IntermediateInput input, Indices output)
{
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementations of the two functions in this class are nearly identical. Can they be refactored into a single function that takes as a parameter the offset to use? #Resolved

}

public static void ParseIndicesToZeroBased(IntermediateInput input, Indices output)
{
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the caller asks that the data be be parsed with a one based index, but the data has a zero in its list of keys? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is written now, the feature will be ignored. (see line 390).
Do you think we should throw?


In reply to: 332208505 [](ancestors = 332208505)

{
if (Conversions.Instance.TryParse(in inputValues[i], out int index) && index >= 0)
editor.Values[i] = index;
else
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the desired behavior if the data is bad? E.g. if the data has duplicate keys? (Maybe throw an exception if the other data loaders throw an exception when the data is loaded?) #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added support for keeping track which indices exist in each row (see the OutputMapper class in line 359). Currently, I am not throwing in case of duplicate keys, just using the value in the first appearance.
Do you think I should throw instead?


In reply to: 332209060 [](ancestors = 332209060)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the both the questions above, I would defer to precedence. If other data loaders throw when bad data is encountered, we should throw here as well.


In reply to: 333656937 [](ancestors = 333656937,332209060)

/// <see cref="IntermediateOut"/> is used.
/// </summary>
private sealed class IntermediateOutKeys
{
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these two nearly identical classes? Or can they be made into a generic class? #Resolved

}

private sealed class IntermediateOut
{
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or as we discussed, is it possible to have indices be uint always, since the indices into the array cannot be negative? #Resolved

private sealed class Output
{
public VBuffer<float> Features;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned that this is a dense array. Would it be better to keep it as a sparse array so that the algorithm that is making use of this data can decide at a later time if it needs to convert this into a dense array?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using a BufferBuilder instead of the VBufferEditor, since it knows how to handle indices out of order. There might be perf implications though, not sure which is better to use in this case.


In reply to: 332210125 [](ancestors = 332210125)

}

private sealed class Output
{
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case to (say after inference, or during evaluation) to map these indices back to keys or the original strings? In that case, does it make sense to provide a separate function/helper to extract the keys/strings back? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of using string keys to load the data, the resulting IDataView schema will have slot names annotations for the Features column, so the feature names can be extracted this way.


In reply to: 332212446 [](ancestors = 332212446)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see in CreateOutputTransformer inside the if (keyIndices) part).


In reply to: 343596251 [](ancestors = 343596251,332212446)

private readonly IHost _host;
private readonly ITransformer _keyVectorsToIndexVectors;
private readonly FeatureIndices _indicesKind;
private readonly ulong _featureCount;
Copy link
Contributor

@harishsk harishsk Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_featureCount [](start = 31, length = 13)

Is it useful to make this public? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure there is a need to do that. You can get the length of the feature vector when you call GetOutputSchema() on the loader.


In reply to: 332212933 [](ancestors = 332212933)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll mark this as resolved, if you think it's important to make public, please re-activate and let me know.


In reply to: 333650195 [](ancestors = 333650195,332212933)

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@codecov
Copy link

codecov bot commented Oct 8, 2019

Codecov Report

Merging #4190 into master will increase coverage by 0.16%.
The diff coverage is 97.83%.

@@            Coverage Diff             @@
##           master    #4190      +/-   ##
==========================================
+ Coverage   75.12%   75.29%   +0.16%     
==========================================
  Files         909      913       +4     
  Lines      160240   161210     +970     
  Branches    17252    17329      +77     
==========================================
+ Hits       120377   121377    +1000     
+ Misses      35049    35016      -33     
- Partials     4814     4817       +3
Flag Coverage Δ
#Debug 75.29% <97.83%> (+0.16%) ⬆️
#production 70.66% <95.98%> (+0.14%) ⬆️
#test 90.43% <100%> (+0.14%) ⬆️
Impacted Files Coverage Δ
....Transforms/SvmLight/SvmLightLoaderSaverCatalog.cs 100% <100%> (ø)
test/Microsoft.ML.TestFramework/TestCommandBase.cs 43.51% <100%> (+2.15%) ⬆️
test/Microsoft.ML.Tests/SvmLightTests.cs 100% <100%> (ø)
.../Microsoft.ML.Transforms/SvmLight/SvmLightSaver.cs 95.69% <95.69%> (ø)
...Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs 95.75% <95.75%> (ø)
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 79.48% <0%> (-20.52%) ⬇️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs 83.19% <0%> (-3.37%) ⬇️
src/Microsoft.ML.AutoML/Sweepers/ISweeper.cs 67.08% <0%> (-2.54%) ⬇️
src/Microsoft.ML.AutoML/Sweepers/Parameters.cs 84.32% <0%> (-0.85%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.9% <0%> (-0.41%) ⬇️
... and 12 more

@codemzs
Copy link
Member

codemzs commented Oct 8, 2019

@yaeldekel Thanks for syncing to master so that code coverage report is generated. Looking at code coverage report it seems there are branches in your code that is not covered by a test, can you please take a look here https://codecov.io/gh/dotnet/machinelearning/pull/4190/diff (on left side it will highlight lines in red) thanks! #Resolved

@yaeldekel
Copy link
Author

Yes, I'll take a look. Question - we normally don't test all the cases where exceptions are thrown, do we?


In reply to: 539612360 [](ancestors = 539612360)

@harishsk
Copy link
Contributor

Code coverage as part of continuous integration has recently come in so I think we get to collectively define what the policy should be going forward and your thoughts and opinions on the issue would be most welcome.


In reply to: 540696889 [](ancestors = 540696889,539612360)

@codemzs
Copy link
Member

codemzs commented Oct 10, 2019

We definitely need to test for negative cases, i.e cases where exception is thrown, I can show you in the code base where we already do that. @eerhardt and @sharwell can probably also chip in and provide feedback.


In reply to: 540705504 [](ancestors = 540705504,540696889,539612360)

@harishsk
Copy link
Contributor

I am sure there are tests for negative cases in our code bases where exceptions are thrown. My question is, whether we have been doing it consistently throughout so far. And if not, is that the new policy you are proposing?


In reply to: 540708878 [](ancestors = 540708878,540705504,540696889,539612360)

@sharwell
Copy link
Member

sharwell commented Oct 10, 2019

I generally encourage the creation of tests to cover these cases. Discretion should be used if there is a case that is not easy to hit (we don't want to spend more than a reasonable amount of time trying to find a way to hit it). Where possible, the negative tests should use the normal API entry points, as opposed to side-stepping the API for direct calls to internals/privates. This helps ensure that the error condition which leads to the negative result does not result in abnormal behavior prior to the expected error handling locations. #Resolved

@codemzs
Copy link
Member

codemzs commented Oct 10, 2019

@sharwell Agreed. @yaeldekel lets try to add negative test. #Resolved

@eerhardt
Copy link
Member

// Licensed to the .NET Foundation under one or more agreements.

Any reason these are in ML.Transforms and not ML.Data? All our other "loaders" are in .Data, correct?


Refers to: src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs:1 in 7358266. [](commit_id = 7358266, deletion_comment = False)

public long? NumberOfRows;
}

#pragma warning disable 0649 // Disable warnings about unused members. They are used through reflection.
Copy link
Member

@eerhardt eerhardt Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just do this around the unused members instead of a huge block? #Resolved

Names
}

public sealed class Options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xml comments on all new public API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these types (FeatureIndices and Options) need to be public? I don't see them being used publicly.


In reply to: 356301711 [](ancestors = 356301711)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently they don't need to be public. There was a discussion about whether there should be a public API that takes an Options parameter, or a FeatureIndices parameter (instead of having one called CreateSvmLightLoader and one called CreateSvmLightLoaderWithFeatureNames). Do you have a preference?


In reply to: 356304532 [](ancestors = 356304532,356301711)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all combinations of enums and options make sense together? Or do some options only make sense for certain values of the enum?

If they all make sense together then I would expose the Options. But if there are options that don’t make sense together, then I would keep it like you have it.

@yaeldekel
Copy link
Author

// Licensed to the .NET Foundation under one or more agreements.

The loader uses some components from Microsoft.ML.Transforms (specifically, CustomMappingTransformer). How critical is it do you think for SvmLightLoader not to be in the transformers assembly? The only other solution I can think of, is to add a new project that would reference both ML.Data and ML.Transforms.


In reply to: 564278518 [](ancestors = 564278518)


Refers to: src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs:1 in 7358266. [](commit_id = 7358266, deletion_comment = False)

@eerhardt
Copy link
Member

I don’t think the assembly it lives in is critical. It just struck me as odd. Thanks for answering my question.

{
public static class LoadingSvmLight
{
// This examples shows all the ways to load data with TextLoader.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader. [](start = 62, length = 11)

Text loader?

Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@codemzs codemzs merged commit 65e6acd into dotnet:master Dec 26, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a loader/saver for SVMLight file format
6 participants