-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Fix crossgen2 ArgIterator for SystemV Amd64 Unix ABI #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The ArgIterator was missing proper handling of the SystemV struct classification. There were also several minor issues.
private bool getSystemVAmd64PassStructInRegisterDescriptor(CORINFO_CLASS_STRUCT_* structHnd, SYSTEMV_AMD64_CORINFO_STRUCT_REG_PASSING_DESCRIPTOR* structPassInRegDescPtr) | ||
{ | ||
TypeDesc typeDesc = HandleToObject(structHnd); | ||
|
||
return _systemVStructClassificator.getSystemVAmd64PassStructInRegisterDescriptor(typeDesc, structPassInRegDescPtr); | ||
SystemVStructClassificator.GetSystemVAmd64PassStructInRegisterDescriptor(typeDesc, out *structPassInRegDescPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this become a problem when PR #495 that enables multi-threaded compilation is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one way to reword this is to ask whether the below classification dictionary should become a ConcurrentDictionary or whether we rather need to put locks around its population.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add the classification to the TypeDesc instead of having a separate cache. We could then set a reference to the classification structure using compare exchange (if we allocated it separately) or assign the passedInRegisters as the last member and use a proper memory barrier / volatile access to ensure it is observed as executed as the last by concurrent threads of execution.
That would actually be similar to the runtime where we have the classification stored in the EEClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit I'm not super familiar with the Linux x64 calling convention specifics but I know this has been a pre-existing deficiency of the Crossgen2 compiler - thanks for fixing this!
{ | ||
private Dictionary<TypeDesc, SYSTEMV_AMD64_CORINFO_STRUCT_REG_PASSING_DESCRIPTOR> _classificationCache = new Dictionary<TypeDesc, SYSTEMV_AMD64_CORINFO_STRUCT_REG_PASSING_DESCRIPTOR>(); | ||
private static Dictionary<TypeDesc, SYSTEMV_AMD64_CORINFO_STRUCT_REG_PASSING_DESCRIPTOR> s_classificationCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't make statics that refer to type system entities (or NodeCache nodes) in the compiler. Doing so limits reusability of the JitInterface. We already have use cases where one can have multiple type system contexts in the same process: in those cases, static cache is just another word for "memory leak". I would prefer this stays an instance field (it also avoids the problems with races Sergiy pointed out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that means that in multi-threaded compilation, we will need to run the classification for the same type upto as many times as there are compilation threads and we would have multiple copies of the classification structure. And I would also need one more classifier for the ArgIterator that doesn't seem to be bound to any JitInterface.
So the idea of adding the classification to the TypeDesc and removing the cache seems to be much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a compromise. We try to keep TypeDesc as small as possible because there's a lot of them.
SystemV classification is only needed on Unix and only for types that are used as method arguments. Is it worth burning sizeof(SYSTEMV_AMD64_CORINFO_STRUCT_REG_PASSING_DESCRIPTOR)
(aligned) bytes on each TypeDesc (or a pointer to the struct, whichever is smaller) to avoid recomputing the same information ~9 times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for the ArgIterator, it is worse, as we create an instance for every function call to generate GCRefMap. So we would repeat the classification over and over and also possibly create a lot of memory garbage unless we can somehow get the current thread's CorInfoImpl (or propagate the current CorInfoImpl down to the place where we create the ArgIterator instance). I'm not sure how reasonable that would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we create an instance for every function call
We don't have means to pass an extra argument with the classificator to gcrefmap computation?
I'm okay adding this to TypeDesc if the math checks out. But since we only need this information for
- value types,
- used as arguments,
- and on Unix only
my gut feel is that side cache is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, I'll look into passing the classifier to the gcrefmap computation. And I am actually also running an experiment at the moment to check what would happen to the build time of all the pri1 tests and frame work assemblies together if there is no caching of that stuff at all. That might also be an option if it turns out to be causing only a minor difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, disabling the caching completely results in virtually zero difference in compilation time of all the framework assemblies and pri1 tests. In fact, the run without the caching was slightly faster, which I assume is just noise.
With the cache:
Total build time: 5962387 msecs
Framework time: 133722 msecs
Compilation time: 4017379 msecs
Execution time: 1811279 msecs
Without the cache:
Total build time: 5944179 msecs
Framework time: 133107 msecs
Compilation time: 3981134 msecs
Execution time: 1829931 msecs
So I am going to remove the cache.
@janvorli I am confused, where are we currently using ArgIterator on unix. This seems like a problem as we should be missing support in many places across the runtime/jit. |
@jashook the ArgIterator in crossgen2 is used only to generate GCRefMap for arguments of calls. We already had support for the classification on the jit interface before. |
Running without the cache showed no difference in the total build time of all the coreclr pri1 tests and all the runtime assemblies, so I am removing the cache.
@jashook - please also remember that Crossgen2 still hasn't shipped, this is part of the work to get it to preview quality. CoreCLR native code in the legacy Crossgen[1] naturally has this arg iterator logic already. |
The ArgIterator was missing proper handling of the SystemV amd64 struct classification.
There were also several minor issues.
The state with this fix should match the native version.
To successfully compile framework with SuperILC, I had to disable System.Runtime.Serialization.Formatters for now, as we hit an assert in JIT when trying to compile it.