-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ComponentCatalog refactor #970
Conversation
Write the AssemblyName into the model, and use it to register the assembly during model load.
Also fix a couple more tests.
Ensure all loaded assemblies are registered in Experiment to maintain compability. Fix tests to not use ComponentCatalog but direct instantiation instead.
cd5e241
to
cde333f
Compare
cde333f
to
430c4a7
Compare
I logged a possibly related issue: #975 |
{ | ||
// Couldn't load as an assembly and not a zip, so warn the user. | ||
ex = ex ?? e; | ||
Console.Error.WriteLine("Warning: Could not load '{0}': {1}", path, ex.Message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be writing to the Console from libraries. And more generally, why do we swallow this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was copied out of the ComponentCatalog and moved to an internal class that is only called from the command line. (see src/Microsoft.ML.Maml/HelpCommand.cs and src/Microsoft.ML.ResultProcessor/ResultProcessor.cs) I wasn't going to change the behavior, because I'm sure someone is depending on it.
} | ||
catch (Exception e) | ||
{ | ||
ex = e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we try to load zip files? Why not either load or unzip depending on the extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above - existing behavior from the command line code that was removed from ComponentCatalog.
throw Contracts.ExceptIO(e, "Extracting extra assembly zip failed: '{0}'", path); | ||
} | ||
|
||
LoadAssembliesInDir(env, dir, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use zip files and just blast them into a folder instead using Nuget? It only works if the extensions don't use other dependencies. The moment they use other dependencies, we need to worry about versions. Nuget (theoretically) handles all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is existing behavior that I am refactoring out of ComponentCatalog so it isn't part of the public API anymore. It is only called from 2 places in the command-line (HelpCommand and ResultProcessor).
ComponentCatalog.CacheClassesExtra(_extraAssemblies); |
machinelearning/src/Microsoft.ML.ResultProcessor/ResultProcessor.cs
Lines 334 to 336 in 160b0df
// extra DLLs for dynamic loading | |
[Argument(ArgumentType.Multiple, HelpText = "Extra DLLs", ShortName = "dll")] | |
public string[] ExtraAssemblies = null; |
} | ||
} | ||
|
||
public static IDisposable CreateAssemblyRegistrar(IHostEnvironment env, string loadAssembliesPath = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the only thing that can be done with the registrar is to dispose it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This is an internal method that is only invoked on that command-line. It creates an object that listens for new assemblies to be loaded, and registers them automatically.
{ | ||
Contracts.CheckValue(env, nameof(env)); | ||
|
||
foreach (Assembly a in AppDomain.CurrentDomain.GetAssemblies()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will attempt to register lots of assemblies (e.g. all framework assemblies). Isn't it wasteful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I brought back the code that checks if the assembly references the assembly containing the LoadableClassAttributeBase
class.
machinelearning/src/Microsoft.ML.Core/ComponentModel/ComponentCatalog.cs
Lines 410 to 421 in 6812cb5
bool found = false; | |
var targetName = target.GetName(); | |
foreach (var name in assembly.GetReferencedAssemblies()) | |
{ | |
if (name.Name == targetName.Name) | |
{ | |
found = true; | |
break; | |
} | |
} | |
if (!found) | |
continue; |
} | ||
|
||
LoadAssembliesInDir(env, dir, false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who deletes the temp path and when?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess nobody. This is an internal API that is only called from 2 places: HelpCommand and ResultProcessor, so it isn't part of our public API after this change.
src/Common/AssemblyLoadingUtils.cs
Outdated
|
||
foreach (var s in _filePrefixesToAvoid) | ||
{ | ||
if (name.StartsWith(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should use the invariant culture/comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I changed this to be StringComparison.OrdinalIgnoreCase
string name = Path.GetFileName(path).ToLowerInvariant(); | ||
switch (name) | ||
{ | ||
case "cqo.dll": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we come up with the list? Are we should the list does not have items that are not needed anymore? Might be at least worth to add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the list in the current code. I just moved the list out of the ComponentCatalog class.
machinelearning/src/Microsoft.ML.Core/ComponentModel/ComponentCatalog.cs
Lines 307 to 333 in 6812cb5
private static bool ShouldSkipPath(string path) | |
{ | |
string name = Path.GetFileName(path).ToLowerInvariant(); | |
switch (name) | |
{ | |
case "cqo.dll": | |
case "fasttreenative.dll": | |
case "libiomp5md.dll": | |
case "libvw.dll": | |
case "matrixinterf.dll": | |
case "microsoft.ml.neuralnetworks.gpucuda.dll": | |
case "mklimports.dll": | |
case "microsoft.research.controls.decisiontrees.dll": | |
case "microsoft.ml.neuralnetworks.sse.dll": | |
case "neuraltreeevaluator.dll": | |
case "optimizationbuilderdotnet.dll": | |
case "parallelcommunicator.dll": | |
case "microsoft.ml.runtime.runtests.dll": | |
case "scopecompiler.dll": | |
case "tbb.dll": | |
case "internallearnscope.dll": | |
case "unmanagedlib.dll": | |
case "vcclient.dll": | |
case "libxgboost.dll": | |
case "zedgraph.dll": | |
case "__scopecodegen__.dll": | |
case "cosmosClientApi.dll": |
src/Common/AssemblyLoadingUtils.cs
Outdated
var paths = Directory.EnumerateFiles(dir, "*.dll"); | ||
foreach (string path in paths) | ||
{ | ||
if (filter && ShouldSkipPath(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least log/trace when we skip files. If somebody creates an extension that starts with "Clr" (not very far fetched) it will silently fail to load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've logged an Info
message for this.
private static void LoadAssembliesInDir(IHostEnvironment env, string dir, bool filter) | ||
{ | ||
if (!Directory.Exists(dir)) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you silently return? Why not throw or at least debug assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there are cases where this is called on potentially non-existing directories, for example here where we try calling it on the AutoLoad
folder, which may not exist. (again that was all existing behavior that I'm refactoring out of ComponentCatalog.)
src/Common/AssemblyLoadingUtils.cs
Outdated
env.ComponentCatalog.RegisterAssembly(assembly); | ||
return assembly; | ||
} | ||
catch (Exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, why do we try to swallow all these errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to only swallow errors from Assembly.LoadFrom
, and log an error, which is the existing behavior.
@@ -385,7 +385,8 @@ private static VersionInfo GetVersionInfo() | |||
verWrittenCur: 0x00010001, // Initial | |||
verReadableCur: 0x00010001, | |||
verWeCanReadBack: 0x00010001, | |||
loaderSignature: LoaderSignature); | |||
loaderSignature: LoaderSignature, | |||
loaderAssemblyName: typeof(MultiOutputRegressionPerInstanceEvaluator).Assembly.FullName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typeof(MultiOutputRegressionPerInstanceEvaluator).Assembly.FullName [](start = 36, length = 67)
So, would it be at all possible for this to have, if left as default or null
, for Assembly.GetCallingAssembly
to be invoked, since every time this is constructed I think it should be in the same class and assembly as is calling this constructor?
Not sure if this is too dangerous with inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a small benchmark of using Assembly.GetExecutingAssembly
, and it was considerably slower than using the typeof
approach, which is why I opt'ed to go with the typeof
approach.
I didn't consider the Assembly.GetCallingAssembly
, but it could possibly work in the default cases, but I'm not sure if we should rely on it all over. My thinking was that it is best to be explicit.
I can make the change if you feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't know, what would have been a change in a handful of files has instead changed to a change in a hundred plus files, because of this choice. So would it not be better to not do this?
Also regarding timings, generally this sort of thing (serialization or deserialization) does not happen in a tight loop so I'm not too worried about it. How slow is it, really?
In reply to: 219941373 [](ancestors = 219941373)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the msdn Remarks section
If the method that calls the GetCallingAssembly method is expanded inline by the just-in-time (JIT) compiler, or if its caller is expanded inline, the assembly that is returned by GetCallingAssembly may differ unexpectedly.
I think explicitly specifying the assembly name is probably the most correct thing to do here.
// These are private since they change over time. If we make them public we risk | ||
// another assembly containing a "copy" of their value when the other assembly | ||
// was compiled, which might not match the code that can load this. | ||
private const uint VerWrittenCur = 0x00010001; | ||
//private const uint VerWrittenCur = 0x00010001; // Initial | ||
private const uint VerWrittenCur = 0x00010002; // Added AssemblyName | ||
private const uint VerReadableCur = 0x00010001; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x00010001 [](start = 44, length = 10)
Hi @eerhardt, this looks mostly good thank you.
I have tested out your change locally, and noticed that when I try to load a new model in the old code, it fails with invalid format. It should fail with a more informative message that it is an invalid version, and I suspect the reason why it does not is you did not update this.
VerReadableCur
should be bumped, if you did not intend the new format to be readable by the old software. If you did intend the new format to still be readable by the old software, you I have I think a bug somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypothetically though you could allow this, if you had written the assembly name in such a way that it did not change the format of the stream itself -- perhaps either by putting it into the unused portion of the header, or else in a side stream of the model (that is, another tiny file in the same directory in the zip archive, that will only be read and looked for if the loadname is unrecognized). This might be a nice property. While forwards compatibility is not a goal, neither do we go out of our way to absolutely guarantee that old versions can't read the new formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there isn't room in the unused portion of the header as the assembly names are rather long strings.
There are non-repository scenarios to worry about right? I don't have access to the Repository
in the TryLoadModelCore
method, where it tries to create the instance from the ComponentCatalog
.
For now, I've bumped the VerReadableCur
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well all right. It might have been nice to do but whatever, forwards compatibility I guess isn't a primary goal. We've just never broken it so starkly as we have here. 😄 But perhaps it's warranted.
Thanks for the review. I've responded to all the current feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eerhardt !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This executes on the plan outlined in #208 (comment) copied here for easy reading:
New Proposal
ComponentCatalog
from being a static class to being an instance member onEnvironment
. This has been a planned refactoring for ML.NET for a while, but hasn't been funded until now.ComponentCatalog
itself. It will have public APIs to register components, but will not do any registering itself - neither by loading assemblies from disk, nor by scanning loaded assemblies.Assembly.FullName
into the model file. We will then register that assembly with theenv.ComponentCatalog
when loading the model.Under normal circumstances, API users won't have to explicitly register components with the
ComponentCatalog
. Using the API to train models won't require looking up components from a catalog - you just create .NET objects like normal. Loading a trained model from disk will register the components inside of it by loading the Assembly and scanning it forLoadableClass
assembly attributes.Fix #208