-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Svmlight loader and saver #4190
Conversation
/// of the form "column=Name" to identify a column to convert. Ideally there would be some | ||
/// other way to specify this other than hacking arguments. The intent of this is to allow | ||
/// things like string keys, a common variant of the format, but one emphatically not allowed | ||
/// by the original format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variation existed in the internal version of this loader, but I have not been able to find any references to it online. How important do you think it is to support this? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed on the call, I think it is important to support this. Since SvmLightLoader is implemented as data loader, it is typically the first stage in the pipeline. If someone has data in this format, they will not be able to load the data into ML.NET if we don't support this.
In reply to: 322014712 [](ancestors = 322014712)
public static class SvmLightLoaderSaverCatalog | ||
{ | ||
/// <summary> | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// [](start = 8, length = 3)
Comments coming soon. #Resolved
// <copyright company="Microsoft Corporation"> | ||
// Copyright (c) Microsoft Corporation. All rights reserved. | ||
// </copyright> | ||
//------------------------------------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the ML .NET header. #Resolved
|
||
namespace Microsoft.ML.Transforms | ||
{ | ||
public static class SvmLightLoaderSaverCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SvmLightLoaderSaverCatalog [](start = 24, length = 26)
Can we please add samples for all of these APIs? #Resolved
@yaeldekel Can you please sync to master and push your changes so we can analyze the code coverage? thanks. #Resolved |
|
||
[Fact] | ||
public void TestSvmLightLoaderAndSaver() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that this function is encapsulating several tests. Is it possible to refactor them into different tests? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I'll do that in the next iteration.
In reply to: 332204210 [](ancestors = 332204210)
/// <param name="numberOfRows">The number of rows from the sample to be used for determining the set of feature names.</param> | ||
/// <param name="dataSample">A data sample to be used for determining the set of features names.</param> | ||
public static SvmLightLoader CreateSvmLightLoaderWithFeatureNames(this DataOperationsCatalog catalog, | ||
long? numberOfRows = null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateTextLoader two versions with different parameters. Is that a better practice to follow and call both functions CreateSvmLightLoader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think I should have one method that takes an SvmLightLoader.FeatureIndices
as a parameter?
In that case, perhaps we could come up with a better name for that enum? Any suggestions?
In reply to: 332206307 [](ancestors = 332206307)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think having a method that takes FeatureIndices as a parameter would be helpful to distinguish the two cases. One option would be to just call the enum as Features
and call the enum values as ZeroBasedIndices
, OneBaseIndices
and Names
.
Also, both the methods aboe have numberOfRows as a parameter, but they are in different positions. If the two functions have the same parameter, it would be good to keep them in the same position.
In reply to: 333653705 [](ancestors = 333653705,332206307)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed the parameter-order part of your comment, but I had second thoughts about having a single API that specifies whether the file contains feature indices or feature names - there is a parameter inputSize
which is only relevant in the feature indices case and not in the feature names case, and I'm not sure that having a parameter that is sometimes ignored is the best thing to have - do you have any thoughts on that?
In reply to: 337250107 [](ancestors = 337250107,333653705,332206307)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your explanation, would reordering and renaming the params as follows work?
// For namesCreateSvmLightLoader to use with indices
public static SvmLightLoader CreateSvmLightLoader(
this DataOperationsCatalog catalog,
long? numberOfRows = null,
IMultiStreamSource dataSample = null,
Features features,
int featureSize,
);
// For namesCreateSvmLightLoaderWithNames
public static SvmLightLoader CreateSvmLightLoader(
this DataOperationsCatalog catalog,
long? numberOfRows = null,
IMultiStreamSource dataSample = null,
Features features
);
The other option is to keep only one version, (whichever you think would be the more common one and change the other one to have the SvmLightLoader.Options class directly as a param. That too has been a pattern in ML.NET.
In either case I think input from @eerhardt would be useful for the public API surface.
In reply to: 343611210 [](ancestors = 343611210,337250107,333653705,332206307)
/// <param name="path">The path to the file.</param> | ||
/// <param name="numberOfRows">The number of rows from the sample to be used for determining the set of feature names.</param> | ||
public static IDataView LoadFromSvmLightFileWithFeatureNames(this DataOperationsCatalog catalog, | ||
string path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as above.
public VBuffer<int> FeatureKeys; | ||
|
||
public static void ParseIndicesToOneBased(IntermediateInput input, Indices output) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementations of the two functions in this class are nearly identical. Can they be refactored into a single function that takes as a parameter the offset to use? #Resolved
} | ||
|
||
public static void ParseIndicesToZeroBased(IntermediateInput input, Indices output) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the caller asks that the data be be parsed with a one based index, but the data has a zero in its list of keys? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it is written now, the feature will be ignored. (see line 390).
Do you think we should throw?
In reply to: 332208505 [](ancestors = 332208505)
{ | ||
if (Conversions.Instance.TryParse(in inputValues[i], out int index) && index >= 0) | ||
editor.Values[i] = index; | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the desired behavior if the data is bad? E.g. if the data has duplicate keys? (Maybe throw an exception if the other data loaders throw an exception when the data is loaded?) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added support for keeping track which indices exist in each row (see the OutputMapper
class in line 359). Currently, I am not throwing in case of duplicate keys, just using the value in the first appearance.
Do you think I should throw instead?
In reply to: 332209060 [](ancestors = 332209060)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the both the questions above, I would defer to precedence. If other data loaders throw when bad data is encountered, we should throw here as well.
In reply to: 333656937 [](ancestors = 333656937,332209060)
/// <see cref="IntermediateOut"/> is used. | ||
/// </summary> | ||
private sealed class IntermediateOutKeys | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these two nearly identical classes? Or can they be made into a generic class? #Resolved
} | ||
|
||
private sealed class IntermediateOut | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or as we discussed, is it possible to have indices be uint always, since the indices into the array cannot be negative? #Resolved
private sealed class Output | ||
{ | ||
public VBuffer<float> Features; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned that this is a dense array. Would it be better to keep it as a sparse array so that the algorithm that is making use of this data can decide at a later time if it needs to convert this into a dense array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to using a BufferBuilder
instead of the VBufferEditor
, since it knows how to handle indices out of order. There might be perf implications though, not sure which is better to use in this case.
In reply to: 332210125 [](ancestors = 332210125)
} | ||
|
||
private sealed class Output | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case to (say after inference, or during evaluation) to map these indices back to keys or the original strings? In that case, does it make sense to provide a separate function/helper to extract the keys/strings back? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of using string keys to load the data, the resulting IDataView
schema will have slot names annotations for the Features column, so the feature names can be extracted this way.
In reply to: 332212446 [](ancestors = 332212446)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see in CreateOutputTransformer
inside the if (keyIndices)
part).
In reply to: 343596251 [](ancestors = 343596251,332212446)
private readonly IHost _host; | ||
private readonly ITransformer _keyVectorsToIndexVectors; | ||
private readonly FeatureIndices _indicesKind; | ||
private readonly ulong _featureCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_featureCount [](start = 31, length = 13)
Is it useful to make this public? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure there is a need to do that. You can get the length of the feature vector when you call GetOutputSchema()
on the loader.
In reply to: 332212933 [](ancestors = 332212933)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll mark this as resolved, if you think it's important to make public, please re-activate and let me know.
In reply to: 333650195 [](ancestors = 333650195,332212933)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
Codecov Report
@@ Coverage Diff @@
## master #4190 +/- ##
==========================================
+ Coverage 75.12% 75.29% +0.16%
==========================================
Files 909 913 +4
Lines 160240 161210 +970
Branches 17252 17329 +77
==========================================
+ Hits 120377 121377 +1000
+ Misses 35049 35016 -33
- Partials 4814 4817 +3
|
@yaeldekel Thanks for syncing to master so that code coverage report is generated. Looking at code coverage report it seems there are branches in your code that is not covered by a test, can you please take a look here https://codecov.io/gh/dotnet/machinelearning/pull/4190/diff (on left side it will highlight lines in red) thanks! #Resolved |
Yes, I'll take a look. Question - we normally don't test all the cases where exceptions are thrown, do we? In reply to: 539612360 [](ancestors = 539612360) |
Code coverage as part of continuous integration has recently come in so I think we get to collectively define what the policy should be going forward and your thoughts and opinions on the issue would be most welcome. In reply to: 540696889 [](ancestors = 540696889,539612360) |
I am sure there are tests for negative cases in our code bases where exceptions are thrown. My question is, whether we have been doing it consistently throughout so far. And if not, is that the new policy you are proposing? In reply to: 540708878 [](ancestors = 540708878,540705504,540696889,539612360) |
I generally encourage the creation of tests to cover these cases. Discretion should be used if there is a case that is not easy to hit (we don't want to spend more than a reasonable amount of time trying to find a way to hit it). Where possible, the negative tests should use the normal API entry points, as opposed to side-stepping the API for direct calls to internals/privates. This helps ensure that the error condition which leads to the negative result does not result in abnormal behavior prior to the expected error handling locations. #Resolved |
@sharwell Agreed. @yaeldekel lets try to add negative test. #Resolved |
public long? NumberOfRows; | ||
} | ||
|
||
#pragma warning disable 0649 // Disable warnings about unused members. They are used through reflection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just do this around the unused members instead of a huge block? #Resolved
Names | ||
} | ||
|
||
public sealed class Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xml comments on all new public API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these types (FeatureIndices and Options) need to be public? I don't see them being used publicly.
In reply to: 356301711 [](ancestors = 356301711)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently they don't need to be public. There was a discussion about whether there should be a public API that takes an Options
parameter, or a FeatureIndices
parameter (instead of having one called CreateSvmLightLoader
and one called CreateSvmLightLoaderWithFeatureNames
). Do you have a preference?
In reply to: 356304532 [](ancestors = 356304532,356301711)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all combinations of enums and options make sense together? Or do some options only make sense for certain values of the enum?
If they all make sense together then I would expose the Options. But if there are options that don’t make sense together, then I would keep it like you have it.
The loader uses some components from In reply to: 564278518 [](ancestors = 564278518) Refers to: src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs:1 in 7358266. [](commit_id = 7358266, deletion_comment = False) |
I don’t think the assembly it lives in is critical. It just struck me as odd. Thanks for answering my question. |
{ | ||
public static class LoadingSvmLight | ||
{ | ||
// This examples shows all the ways to load data with TextLoader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader. [](start = 62, length = 11)
Text loader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #4014 .