-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Streaming mode for SPMI #98440
JIT: Streaming mode for SPMI #98440
Conversation
Create a new SPMI (replay) mode that binds an SPMI process to a collection and jit and initial options set, and then repeatedly reads method numbers and overriding jit options from a file until the file is closed, or contains a line beginning with "stop". Clients can poll the stdout from this mode to extract output from the jit for each invocation, end-delimited by `[streaming] Done.` This seems to be quite a bit faster than launching a new process for each invocation. For example, given a reply file like ``` 49974!JitRLCSE=!JitRLCSEAlpha=0.02!JitRandomCSE=161 49974!JitRLCSE=!JitRLCSEAlpha=0.02!JitRandomCSE=171 49974!JitRLCSE=!JitRLCSEAlpha=0.02!JitRandomCSE=161!JitReplayCSE=1,0!JitReplayCSEReward=0,0 49974!JitRLCSE=!JitRLCSEAlpha=0.02!JitRandomCSE=171!JitReplayCSE=0!JitReplayCSEReward=0 stop ``` one can now do something like: ``` %% superpmi.exe -v q -jitoption JitMetrics=1 clrjit.dll collection.mch -streaming 49974-replay.txt ; Total bytes of code 119, prolog size 6, PerfScore 30.75, instruction count 35, allocated bytes for code 119, num cse 1 num cand 1 RL Policy Gradient Stochastic seq 1,0 likelihoods 0.500,1.000 baseLikelihoods 0.000,0.500,1.000,0.500 spmi index 49974 (MethodHash=870c8ba4) for method System.Threading.Channels.AsyncOperation`1[System.__Canon]:GetResult(short):System.__Canon:this (Tier1) [streaming] Done. ; Total bytes of code 117, prolog size 6, PerfScore 32.50, instruction count 34, allocated bytes for code 117, num cse 0 num cand 1 RL Policy Gradient Stochastic seq 0 likelihoods 0.500 baseLikelihoods 0.000,0.500,1.000,0.500 spmi index 49974 (MethodHash=870c8ba4) for method System.Threading.Channels.AsyncOperation`1[System.__Canon]:GetResult(short):System.__Canon:this (Tier1) [streaming] Done. ; Total bytes of code 119, prolog size 6, PerfScore 30.75, instruction count 35, allocated bytes for code 119, num cse 1 num cand 1 RL Policy Gradient Update seq 1,0 updatedparams 0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000 spmi index 49974 (MethodHash=870c8ba4) for method System.Threading.Channels.AsyncOperation`1[System.__Canon]:GetResult(short):System.__Canon:this (Tier1) [streaming] Done. ; Total bytes of code 117, prolog size 6, PerfScore 32.50, instruction count 34, allocated bytes for code 117, num cse 0 num cand 1 RL Policy Gradient Update seq 0 updatedparams 0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000 spmi index 49974 (MethodHash=870c8ba4) for method System.Threading.Channels.AsyncOperation`1[System.__Canon]:GetResult(short):System.__Canon:this (Tier1) [streaming] Done. ``` The input format is like shown above, a method number and then option settings prefixed by `!`. Lines starting with `#` are ignored. `-streaming stdin` is also supported, for your interactive SPMI needs.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsCreate a new SPMI (replay) mode that binds an SPMI process to a collection and jit and initial options set, and then repeatedly reads method numbers and overriding jit options from a file until the file is closed, or contains a line beginning with "stop". Clients can poll the stdout from this mode to extract output from the jit for each invocation, end-delimited by For example, given a reply file like
one can now do something like:
The input format is like shown above, a method number and then option settings prefixed by
|
@BruceForstall PTAL Not sure this is entirely done, but it is working, and I'd like to get some perspective. For instance, I could add a command verb here so other SPMI capabilities can be added later. Or try and get parallel mode working. Etc. This mode also could be useful for bisection and so on. I am tempted to use a separate file for output, since stdout/stderr can contain other stray writes that mess up the "protocol". I would need to pass this through to the JIT, and currently you can't reset jitstdout without unloading and reloading. Speaking of which, there are other bits of static state in the JIT that make rerunning like this a bit tricky, in particular the static caches for some config settings that now need to change. I fixed a few of these in CSE but there are likely more. I have this hooked up in RLCSE and it speeds up things by about 4x on windows over launching a process per request -- think it will even be better once I iron out some issues on the RLCSE side. |
@@ -2219,7 +2219,7 @@ void CSE_HeuristicReplay::ConsiderCandidates() | |||
return; | |||
} | |||
|
|||
static ConfigIntArray JitReplayCSEArray; | |||
ConfigIntArray JitReplayCSEArray; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these functions get called multiple times per compilation? Now, EnsureInit
will always re-build them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they do.
I can make these into fields on the heuristic object so they only get set once per compilation. Currently I only expect they'll be read in "training" modes; the greedy policy will have baked-in settings.
@@ -2075,7 +2075,8 @@ void CodeGen::genEmitMachineCode() | |||
{ | |||
printf("; ============================================================\n\n"); | |||
} | |||
printf(""); // in our logic this causes a flush | |||
|
|||
fflush(jitstdout()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the comment above was wrong? (about flushing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps? I could not find where jitstdout
got flushed (it is set to unbuffered in some cases).
{ | ||
if (++i >= argc) | ||
{ | ||
LogError("'-streaming' must be followed by a file name or stdin."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogError("'-streaming' must be followed by a file name or stdin."); | |
LogError("'-streaming' must be followed by a file name or `stdin`."); |
printf(" Streaming mode. Read and execute work requests from indicated file (can be 'stdin').\n"); | ||
printf(" Each line is a method context number and additional force jit options for that method.\n"); | ||
printf(" Blank line or EOF terminates\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does streaming work with parallel? If not, can the parser below check that during validation?
The format of the streaming file needs to be better described (it doesn't mention !
separation, for example).
It seems like the format maybe should be multi-line, with an empty line or something like [streaming go]
kicking off the run. E.g.,
[streaming mc=12345] // start of streaming parameter block. Maybe mc could be a range? e.g., 1-2,4,6-70
JitRLCSE=
# try 0.02 (comment...)
JitRLCSEAlpha=0.02
JitRandomCSE=161
[streaming go]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be lots more flexibility here, but the way I'm using it now I do not know very far in advance what the next run will look like, so there aren't opportunities for batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's not about batching; it's about allowing multiple lines for configuration instead of one long single-line unreadable block of text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently all the streaming inputs are written by RLCSE so a compact format works well. I guess if streaming inputs are authored by hand then readability matters a bit more. How about something like:
REPLAY
METHODS
... (one or more lines of method numbers, ranges, etc)
OPTIONS
... (one option per line)
GO
(server carries out command and replies)
and later this can be expanded to handle more things, eg switching jits or collections or ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That format seems good to me. I wonder if the "commands" need to use magic characters to make sure the things between them can't be ambiguous, e.g. :METHODS
or [METHODS]
. That could be changed later if necessary.
// Syntax is dddd { ! <jit-option>=value }* | ||
// Likes starting with '#' are ignored | ||
// | ||
while (fgets(line, sizeof(line), streamFile) != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this function could use some refactoring, so we end up with, e.g.:
while (ParseNextStreamingRequest())
{
InvokeJITOnStreamingRequest();
}
Seems like there's no point implementing parallel support if the only point of streaming is to reduce process launch overhead, and parallel launches N processes. |
The way I'm using it now is to just launch N of these server processes and implement a work distribution queue on the client side. Results with RLCSE show about a 6x improvement over launching a process (current approach will launch a new single-method SPMI process for each minibatch run, so 25 in the example below).
If I create more servers than CPUs (might make sense if SPMI spends time blocked on I/O)
|
Diff results for #98440Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for osx/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Details here |
reader->Reset(&index, 1); | ||
MethodContextBuffer mcb = reader->GetNextMethodContext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your scenario, it looks like you re-run compilations on the same MC many times. Instead of re-reading (and re-hydrating) the same MC over and over, why don't you first see if the mc
you have is the one you want, and re-use it. You could imagine having a small cache of loaded mc
if that makes sense. You still need to delete (and null out) mc->cr
after every run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it's the MethodContextBuffer
that is the key thing to cache.
Let me try this.
I was also going to have the client run MCS and produce a smaller MCH with just the methods it wants to repeatedly evaluate, but am not sure how to match up the indexes from the big and small sets. Does MCH subsetting keep things in relative order or in index relative order or is there no relationship?
That is, do we end up with
;; old->new index
3,5,2 -> 1,2,3 (arg order)
3,5,2 -> 2,3,1 (index order)
or something other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the order; you'll have to check.
As for caching, doesn't the MethodContext (not MethodContextBuffer) contain the hydrated types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added caching.
The only messy part was that MC's always need to have some kind of CR attached, so I now delete the old one and attach a new empty one.
Diff results for #98440Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.01%)
Details here |
Add support to use streaming SPMI servers. When enabled, RLCSE will launch one SPMI server process per core. Work dispatched to servers picks an idle server from the pool of servers. This greatly reduces the overhead of doing single-method evaluations. SPMI streaming mode requires the changes in dotnet/runtime#98440. Update the main Policy graident work loop and the greedy status loop to use the streaming mode, if enabled. MCMC can leverage this too, but I haven't done that just yet. Add the ability to track server status. Add the ability to log the server activity. The intent is that these logs can be replayed by a server instance to diagnose problems. This was useful during bring-up. Add a running log of the Policy Gradient parameter values, so we can see how they change over time. Refactor out some of the code in the main Policy Gradient loops.
Add support to use streaming SPMI servers. When enabled, RLCSE will launch one SPMI server process per core. Work dispatched to servers picks an idle server from the pool of servers. This greatly reduces the overhead of doing single-method evaluations. SPMI streaming mode requires the changes in dotnet/runtime#98440. Update the main Policy graident work loop and the greedy status loop to use the streaming mode, if enabled. MCMC can leverage this too, but I haven't done that just yet. Add the ability to track server status. Add the ability to log the server activity. The intent is that these logs can be replayed by a server instance to diagnose problems. This was useful during bring-up. Add a running log of the Policy Gradient parameter values, so we can see how they change over time. Refactor out some of the code in the main Policy Gradient loops.
I have run a bunch of client-server experiments in conjunction with dotnet/jitutils#397 and it all seems to be working well enough that this is a usable checkpoint. @BruceForstall I would prefer to defer further refactoring since I need to keep the client in sync, am actively using this, and want to focus on some other changes first. Deferred bits:
|
Add support to use streaming SPMI servers. When enabled, RLCSE will launch one SPMI server process per core. Work dispatched to servers picks an idle server from the pool of servers. This greatly reduces the overhead of doing single-method evaluations. SPMI streaming mode requires the changes in dotnet/runtime#98440. Update the main Policy graident work loop and the greedy status loop to use the streaming mode, if enabled. MCMC can leverage this too, but I haven't done that just yet. Add the ability to track server status. Add the ability to log the server activity. The intent is that these logs can be replayed by a server instance to diagnose problems. This was useful during bring-up. Add a running log of the Policy Gradient parameter values, so we can see how they change over time. Refactor out some of the code in the main Policy Gradient loops.
Create a new SPMI (replay) mode that binds an SPMI process to a collection and jit and initial options set, and then repeatedly reads method numbers and overriding jit options from a file until the file is closed, or contains a line beginning with "stop".
Clients can poll the stdout from this mode to extract output from the jit for each invocation, end-delimited by
[streaming] Done.
This seems to be quite a bit faster than launching a new process for each invocation.For example, given a reply file like
one can now do something like:
The input format is like shown above, a method number and then option settings prefixed by
!
. Lines starting with#
are ignored.-streaming stdin
is also supported, for your interactive SPMI needs.