Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't perform extra inference during incremental image creation #48054

Merged
merged 1 commit into from
Jan 5, 2023

Conversation

Keno
Copy link
Member

@Keno Keno commented Dec 30, 2022

As noted in #48047, we're currently attempting to infer extra methods during incremental image saving, which causes us to miss edges in the image. In particular, in the case of #48047, Cthulhu had the compile=min option set, which caused the code instance for do_typeinf! to not be infered. However, later it was nevertheless queued for precompilation, causing inference to occur at an inopportune time.

This PR simply prevents code instances that don't explicitly have the ->precompile flag set (e.g. the guard instance created for the interpreter) from being enqueued for precompilation. It is not clear that this is necessarily the correct behavior - we may in fact want to infer these method instances, just before we set up the serializer state, but for now this fixes #48047 for me.

I also included an appropriate test and a warning message if we attempt to enter inference when this is not legal, so any revisit of what should be happening here can hopefully make use of those.

@Keno Keno added the backport 1.9 Change should be backported to release-1.9 label Dec 30, 2022
@@ -188,7 +188,7 @@ static int precompile_enq_specialization_(jl_method_instance_t *mi, void *closur
(jl_ir_inlining_cost((jl_array_t*)inferred) == UINT16_MAX)) {
do_compile = 1;
}
else if (jl_atomic_load_relaxed(&codeinst->invoke) != NULL || jl_atomic_load_relaxed(&codeinst->precompile)) {
else if (jl_atomic_load_relaxed(&codeinst->precompile)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead we can check that invoke is not set to jl_fptr_interpret_call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would work for this case, but I don't really understand the purpose and semantics of the ->precompile field then if we just basically ignore it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timholy wrote this particular section of code.

@Keno
Copy link
Member Author

Keno commented Jan 2, 2023

I'm planning to merge this today so Revise starts working again. Obviously if @timholy has alternative ideas for the correct criterion here, so should do that as a followup.

@Keno Keno force-pushed the kf/48047 branch 2 times, most recently from 8c71279 to 756db23 Compare January 2, 2023 13:09
@Keno
Copy link
Member Author

Keno commented Jan 2, 2023

Looks like the condition here isn't quite enough. Revise triggers the error message I added:

julia> using Revise
[ Info: Precompiling Revise [295af30f-e4ad-537b-8983-00126c2a3abe]
ERROR: Attempted to enter inference while writing out image.

@Keno
Copy link
Member Author

Keno commented Jan 2, 2023

Looks like @vtjnash was right and we just can't have that case at all (if we want it, we need to have it do re-inference earlier).

@Keno
Copy link
Member Author

Keno commented Jan 4, 2023

Sigh, we have a test that fails when some of these get removed from the sysimage. I'm just gonna turn off the warning I added here, put back the logic that adds them to the list and then rely on the check I added at the inference entry to avoid having it actually do the inference. Not pretty, but should fix the bug and hopefully @vtjnash or @timholy will fix this properly soon.

As noted in #48047, we're currently attempting to infer extra
methods during incremental image saving, which causes us to miss
edges in the image. In particular, in the case of #48047, Cthulhu
had the `compile=min` option set, which caused the code instance
for `do_typeinf!` to not be infered. However, later it was
nevertheless queued for precompilation, causing inference to
occur at an inopportune time.

This PR simply prevents code instances that don't explicitly have
the `->precompile` flag set (e.g. the guard instance created for
the interpreter) from being enqueued for precompilation. It is
not clear that this is necessarily the correct behavior - we may
in fact want to infer these method instances, just before we set
up the serializer state, but for now this fixes #48047 for me.

I also included an appropriate test and a warning message if
we attempt to enter inference when this is not legal, so any
revisit of what should be happening here can hopefully make
use of those.
@Keno Keno merged commit 80aeebe into master Jan 5, 2023
@Keno Keno deleted the kf/48047 branch January 5, 2023 12:47
@timholy
Copy link
Member

timholy commented Jan 5, 2023

Sorry about this. Until today, for the last few months I've been ignoring my gmail (GitHub notifications), and only just saw this.

I think the code in question is old, for example here it is in 1.8:

julia/src/precompile.c

Lines 288 to 311 in 0efc765

static int precompile_enq_specialization_(jl_method_instance_t *mi, void *closure)
{
assert(jl_is_method_instance(mi));
jl_code_instance_t *codeinst = mi->cache;
while (codeinst) {
int do_compile = 0;
if (codeinst->invoke != jl_fptr_const_return) {
if (codeinst->inferred && codeinst->inferred != jl_nothing &&
jl_ir_flag_inferred((jl_array_t*)codeinst->inferred) &&
!jl_ir_flag_inlineable((jl_array_t*)codeinst->inferred)) {
do_compile = 1;
}
else if (codeinst->invoke != NULL || codeinst->precompile) {
do_compile = 1;
}
}
if (do_compile) {
jl_array_ptr_1d_push((jl_array_t*)closure, (jl_value_t*)mi);
return 1;
}
codeinst = jl_atomic_load_relaxed(&codeinst->next);
}
return 1;
}

It seems to be identical to what you started with before this PR.

I think the big difference is that we're calling it from more paths now than we used to be. And probably at a later stage: I moved the time-of-call to after we'd done the system traversal to make sure that the native code for external methods (methods we add to pre-existing modules) and external specializations (new specializations of methods we don't own) also got written, so I was just trying to re-use the work we had already done on system traversal. In rough terms, that moved the call location from somewhere around here:

julia/src/precompile.c

Lines 108 to 116 in dc2b4d9

void *native_code = NULL;
bool_t emit_native = jl_options.outputo || jl_options.outputbc || jl_options.outputunoptbc || jl_options.outputasm;
bool_t emit_split = jl_options.outputji && emit_native;
ios_t *s = NULL;
ios_t *z = NULL;
int64_t srctextpos = 0 ;

to now being called from here:
*_native_data = jl_precompile_worklist(worklist, extext_methods, new_specializations);

The intent is simply to cache "what we've got." I think anything that has precompile set (EDIT: see important clarification in #48054 (comment)) but actually isn't precompiled already is just stuck; in my view, module-level compilation settings have priority over tagging (the tag is set when precompilation occurs, so we expect all the inference to be done by then if it's ever going to be done). The flag is a cache_me rather than a "please compile me" tag. It is intended to make sure we have a way to say "cache me even if we don't have backedges to the worklist we're caching now."

Does that provide enough of a roadmap to know how to fix this?

@vtjnash
Copy link
Member

vtjnash commented Jan 5, 2023

After that system traversal, any code execution will invalidate all of that result, since the system has changed

@timholy
Copy link
Member

timholy commented Jan 5, 2023

I think we're agreeing, but I'm not certain I know what you mean. What execution are you referring to?

To clarify, my point is that we shouldn't do any additional inference; if we discover anything that needs inference before we can pass it to LLVM, we shouldn't infer and we shouldn't pass to LLVM. The only reason to pass anything through LLVM (again) is because we need to generate a clean LLVM module for caching, but if we didn't need to do that then I'd say "no more compilation at all."

The flag on the codeinst is misleading (EDIT: see #48054 (comment)) because it should probably be named cacheme rather than precompile. It gets set under certain circumstances to ensure that we cache what we already precompiled, even if there aren't any backedges; it's not prospectively saying "you should precompile this."

@vtjnash
Copy link
Member

vtjnash commented Jan 5, 2023

That is fine, but you have to re-run jl_prepare_serialization_data after, since the previously prepared data is no longer valid

@timholy
Copy link
Member

timholy commented Jan 5, 2023

So are you proposing basically that we call jl_prepare_serialization_data again after the jl_precompile_worklist? That sounds straightforward. But I thought it was also bad to infer things that the module-settings said shouldn't be aggressively compiled? Thus I'm wondering if just deleting #48054 (comment) is indeed the right answer.

Understand that because of my LLVM-ignorance I am a little vague about what the various code/opaque-pointer fields of a codeinst actually mean and what their settings can be. So my vague, general statement is that "if we already emitted LLVM code for this thing, we need to re-do that in the clean module; if we didn't, then perhaps we shouldn't now." I trust others know more than I about what to actually change to make it happen that way, and I'll be grateful if someone besides me implements the actual fix. Not because I'm lazy.

@timholy
Copy link
Member

timholy commented Jan 6, 2023

Ah shoot, I need to correct one thing:

https://github.com/JuliaLang/julia/blame/db00cc1a8455ceb4a8dc3cb8dd6ded3d62e46dcb/src/julia.h#L372

vs

https://github.com/JuliaLang/julia/blame/f8d3bd22f68f10bd2dc3f1be3b74bfa7cb125bc5/src/julia.h#L387

The one I added (to jl_method_instance_t) means "cache me"; I can't speak for the one Jeff added. And I think the one causing trouble here is the one Jeff added, but only because of the new way it's being used. So I'm less sure of the proper solution here; maybe @vtjnash's hint that maybe we just scoop up the whole system a second time is what's required? Not sure.

KristofferC pushed a commit that referenced this pull request Jan 10, 2023
As noted in #48047, we're currently attempting to infer extra
methods during incremental image saving, which causes us to miss
edges in the image. In particular, in the case of #48047, Cthulhu
had the `compile=min` option set, which caused the code instance
for `do_typeinf!` to not be infered. However, later it was
nevertheless queued for precompilation, causing inference to
occur at an inopportune time.

This PR simply prevents code instances that don't explicitly have
the `->precompile` flag set (e.g. the guard instance created for
the interpreter) from being enqueued for precompilation. It is
not clear that this is necessarily the correct behavior - we may
in fact want to infer these method instances, just before we set
up the serializer state, but for now this fixes #48047 for me.

I also included an appropriate test and a warning message if
we attempt to enter inference when this is not legal, so any
revisit of what should be happening here can hopefully make
use of those.

(cherry picked from commit 80aeebe)
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing backedge somewhere on master?
6 participants