Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComponentCatalog refactor #970

Merged
merged 15 commits into from
Sep 25, 2018
Merged

Conversation

eerhardt
Copy link
Member

This executes on the plan outlined in #208 (comment) copied here for easy reading:

New Proposal

  1. We will move ComponentCatalog from being a static class to being an instance member on Environment. This has been a planned refactoring for ML.NET for a while, but hasn't been funded until now.
  2. We will completely remove any implicit scanning for components in ComponentCatalog itself. It will have public APIs to register components, but will not do any registering itself - neither by loading assemblies from disk, nor by scanning loaded assemblies.
  3. Other subsystems (like the GUI, command-line, Entry Points, and model loading) will be responsible for registering the components they require in the manner they require.
  4. During model saving, we will write the Assembly.FullName into the model file. We will then register that assembly with the env.ComponentCatalog when loading the model.
    • Any model that was saved with a previous version of ML.NET, and loaded using the API, will need to explicitly register the components before loading the model. (Or they will need to save the model again with a current version of ML.NET that will save the assembly names.)

Under normal circumstances, API users won't have to explicitly register components with the ComponentCatalog. Using the API to train models won't require looking up components from a catalog - you just create .NET objects like normal. Loading a trained model from disk will register the components inside of it by loading the Assembly and scanning it for LoadableClass assembly attributes.

Fix #208

Write the AssemblyName into the model, and use it to register the assembly during model load.
Ensure all loaded assemblies are registered in Experiment to maintain compability.
Fix tests to not use ComponentCatalog but direct instantiation instead.
@eerhardt eerhardt force-pushed the ComponentCatalogRefactor branch 5 times, most recently from cd5e241 to cde333f Compare September 21, 2018 15:54
@justinormont
Copy link
Contributor

I logged a possibly related issue: #975 "ComponentCatalog logging errors"

{
// Couldn't load as an assembly and not a zip, so warn the user.
ex = ex ?? e;
Console.Error.WriteLine("Warning: Could not load '{0}': {1}", path, ex.Message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be writing to the Console from libraries. And more generally, why do we swallow this error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was copied out of the ComponentCatalog and moved to an internal class that is only called from the command line. (see src/Microsoft.ML.Maml/HelpCommand.cs and src/Microsoft.ML.ResultProcessor/ResultProcessor.cs) I wasn't going to change the behavior, because I'm sure someone is depending on it.

}
catch (Exception e)
{
ex = e;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we try to load zip files? Why not either load or unzip depending on the extension?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - existing behavior from the command line code that was removed from ComponentCatalog.

throw Contracts.ExceptIO(e, "Extracting extra assembly zip failed: '{0}'", path);
}

LoadAssembliesInDir(env, dir, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use zip files and just blast them into a folder instead using Nuget? It only works if the extensions don't use other dependencies. The moment they use other dependencies, we need to worry about versions. Nuget (theoretically) handles all of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is existing behavior that I am refactoring out of ComponentCatalog so it isn't part of the public API anymore. It is only called from 2 places in the command-line (HelpCommand and ResultProcessor).

ComponentCatalog.CacheClassesExtra(_extraAssemblies);

// extra DLLs for dynamic loading
[Argument(ArgumentType.Multiple, HelpText = "Extra DLLs", ShortName = "dll")]
public string[] ExtraAssemblies = null;

}
}

public static IDisposable CreateAssemblyRegistrar(IHostEnvironment env, string loadAssembliesPath = null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only thing that can be done with the registrar is to dispose it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This is an internal method that is only invoked on that command-line. It creates an object that listens for new assemblies to be loaded, and registers them automatically.

{
Contracts.CheckValue(env, nameof(env));

foreach (Assembly a in AppDomain.CurrentDomain.GetAssemblies())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will attempt to register lots of assemblies (e.g. all framework assemblies). Isn't it wasteful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I brought back the code that checks if the assembly references the assembly containing the LoadableClassAttributeBase class.

bool found = false;
var targetName = target.GetName();
foreach (var name in assembly.GetReferencedAssemblies())
{
if (name.Name == targetName.Name)
{
found = true;
break;
}
}
if (!found)
continue;

}

LoadAssembliesInDir(env, dir, false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who deletes the temp path and when?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess nobody. This is an internal API that is only called from 2 places: HelpCommand and ResultProcessor, so it isn't part of our public API after this change.


foreach (var s in _filePrefixesToAvoid)
{
if (name.StartsWith(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use the invariant culture/comparison

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I changed this to be StringComparison.OrdinalIgnoreCase

string name = Path.GetFileName(path).ToLowerInvariant();
switch (name)
{
case "cqo.dll":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we come up with the list? Are we should the list does not have items that are not needed anymore? Might be at least worth to add a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the list in the current code. I just moved the list out of the ComponentCatalog class.

private static bool ShouldSkipPath(string path)
{
string name = Path.GetFileName(path).ToLowerInvariant();
switch (name)
{
case "cqo.dll":
case "fasttreenative.dll":
case "libiomp5md.dll":
case "libvw.dll":
case "matrixinterf.dll":
case "microsoft.ml.neuralnetworks.gpucuda.dll":
case "mklimports.dll":
case "microsoft.research.controls.decisiontrees.dll":
case "microsoft.ml.neuralnetworks.sse.dll":
case "neuraltreeevaluator.dll":
case "optimizationbuilderdotnet.dll":
case "parallelcommunicator.dll":
case "microsoft.ml.runtime.runtests.dll":
case "scopecompiler.dll":
case "tbb.dll":
case "internallearnscope.dll":
case "unmanagedlib.dll":
case "vcclient.dll":
case "libxgboost.dll":
case "zedgraph.dll":
case "__scopecodegen__.dll":
case "cosmosClientApi.dll":

var paths = Directory.EnumerateFiles(dir, "*.dll");
foreach (string path in paths)
{
if (filter && ShouldSkipPath(path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least log/trace when we skip files. If somebody creates an extension that starts with "Clr" (not very far fetched) it will silently fail to load.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've logged an Info message for this.

private static void LoadAssembliesInDir(IHostEnvironment env, string dir, bool filter)
{
if (!Directory.Exists(dir))
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you silently return? Why not throw or at least debug assert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there are cases where this is called on potentially non-existing directories, for example here where we try calling it on the AutoLoad folder, which may not exist. (again that was all existing behavior that I'm refactoring out of ComponentCatalog.)

env.ComponentCatalog.RegisterAssembly(assembly);
return assembly;
}
catch (Exception)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, why do we try to swallow all these errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to only swallow errors from Assembly.LoadFrom, and log an error, which is the existing behavior.

@@ -385,7 +385,8 @@ private static VersionInfo GetVersionInfo()
verWrittenCur: 0x00010001, // Initial
verReadableCur: 0x00010001,
verWeCanReadBack: 0x00010001,
loaderSignature: LoaderSignature);
loaderSignature: LoaderSignature,
loaderAssemblyName: typeof(MultiOutputRegressionPerInstanceEvaluator).Assembly.FullName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typeof(MultiOutputRegressionPerInstanceEvaluator).Assembly.FullName [](start = 36, length = 67)

So, would it be at all possible for this to have, if left as default or null, for Assembly.GetCallingAssembly to be invoked, since every time this is constructed I think it should be in the same class and assembly as is calling this constructor?

Not sure if this is too dangerous with inlining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a small benchmark of using Assembly.GetExecutingAssembly, and it was considerably slower than using the typeof approach, which is why I opt'ed to go with the typeof approach.

I didn't consider the Assembly.GetCallingAssembly, but it could possibly work in the default cases, but I'm not sure if we should rely on it all over. My thinking was that it is best to be explicit.

I can make the change if you feel strongly about it.

Copy link
Contributor

@TomFinley TomFinley Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't know, what would have been a change in a handful of files has instead changed to a change in a hundred plus files, because of this choice. So would it not be better to not do this?

Also regarding timings, generally this sort of thing (serialization or deserialization) does not happen in a tight loop so I'm not too worried about it. How slow is it, really?


In reply to: 219941373 [](ancestors = 219941373)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the msdn Remarks section

If the method that calls the GetCallingAssembly method is expanded inline by the just-in-time (JIT) compiler, or if its caller is expanded inline, the assembly that is returned by GetCallingAssembly may differ unexpectedly.

I think explicitly specifying the assembly name is probably the most correct thing to do here.

// These are private since they change over time. If we make them public we risk
// another assembly containing a "copy" of their value when the other assembly
// was compiled, which might not match the code that can load this.
private const uint VerWrittenCur = 0x00010001;
//private const uint VerWrittenCur = 0x00010001; // Initial
private const uint VerWrittenCur = 0x00010002; // Added AssemblyName
private const uint VerReadableCur = 0x00010001;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x00010001 [](start = 44, length = 10)

Hi @eerhardt, this looks mostly good thank you.

I have tested out your change locally, and noticed that when I try to load a new model in the old code, it fails with invalid format. It should fail with a more informative message that it is an invalid version, and I suspect the reason why it does not is you did not update this.

VerReadableCur should be bumped, if you did not intend the new format to be readable by the old software. If you did intend the new format to still be readable by the old software, you I have I think a bug somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothetically though you could allow this, if you had written the assembly name in such a way that it did not change the format of the stream itself -- perhaps either by putting it into the unused portion of the header, or else in a side stream of the model (that is, another tiny file in the same directory in the zip archive, that will only be read and looked for if the loadname is unrecognized). This might be a nice property. While forwards compatibility is not a goal, neither do we go out of our way to absolutely guarantee that old versions can't read the new formats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there isn't room in the unused portion of the header as the assembly names are rather long strings.

There are non-repository scenarios to worry about right? I don't have access to the Repository in the TryLoadModelCore method, where it tries to create the instance from the ComponentCatalog.

For now, I've bumped the VerReadableCur value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well all right. It might have been nice to do but whatever, forwards compatibility I guess isn't a primary goal. We've just never broken it so starkly as we have here. 😄 But perhaps it's warranted.

@eerhardt
Copy link
Member Author

Thanks for the review. I've responded to all the current feedback.

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eerhardt !

Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants