refactor(prover): combine proxy and worker#1688
Conversation
bccb538 to
b9ee9ef
Compare
54bc12a to
1aa393a
Compare
SantiagoPittella
left a comment
There was a problem hiding this comment.
Looks amazing, i really like this simplification and the new design.
Left some minor nit comments.
Regarding the gRPC schema, should we removed the references to the proxy and worker from the schemas? It doesn't make much sense to keep it now IMO, just having one service definition. And probably we need to extend a with the service status schema to include the capacity.
I think we should but I'd like to do that in a different PR (if we get consensus). This would also impact the network monitor and I'd like to isolate that change. |
| // Blocking in place is fairly safe since we guarantee that only a single request is | ||
| // processed at a time. | ||
| // | ||
| // This has the downside that requests being proven cannot be cancelled since we are now | ||
| // outside the async runtime. This could occur if the server timeout is exceeded, or | ||
| // the client cancels the request. A different approach is technically possible, but | ||
| // would require more complex logic to handle cancellation in tandem with sync. | ||
| tokio::task::block_in_place(|| prover.prove(request)).map(tonic::Response::new) |
There was a problem hiding this comment.
So this is something I'd like to address somehow. But we can try that in a follow-up PR because I think it would complicate this review.
As a summary:
- We have a
server.timeoutwhich should cancel a request if its been enqueued for longer than the timeout. - This only works for tasks in the async runtime.
block_in_placemoves the current thread off the async runtime, making it uncancelable.
So right now what happens is that if there is a very long prove request, then it completes but likely all other requests in-queue get timed out. Which is of course unfair and not really what we want.
I think what we want is to bound the time a single request spends actually proving. We already have "back-pressure" in the form of the bounded queue capacity so there is no real benefit to punishing those already in the queue.
Cancelling a sync task isn't trivial, but I think we can achieve this using a dedicated prover thread + a manager task which aborts the thread if it takes too long.
There was a problem hiding this comment.
Question: why do we need the task here to be sync? Is this because the prove() function is sync? If so, this will change shortly (in v0.21 VM, prove() is an async function).
There was a problem hiding this comment.
Oh.. Yes because prove() is (expensive) sync and would otherwise block the runtime.
In which case I'm happy to leave this as is until we get async prove.
Merge next
878ed08 to
e96d3f5
Compare
This PR effectively rewrites the
miden-remote-prover, removing the distinction betweenproxyandworkermode by adding a configurable request queue to the "worker".This removes the need for
pingoraas we now essentially replace it with a basic queuing mechanism. This was proposed in #1171 (comment).Given the scale of the refactor, I don't think the diff view will be of much use. Instead a from-scratch review is probably necessary, though I think (hope) it should be quite straight-forward.
Prover gRPC schema
I didn't touch the gRPC schema of the prover. The only real problem here is that we have both a proxy and worker status definition and its unclear now to which they refer.
This isn't such a big deal, though I will note that the gRPC reflection on the
miden-remote-provernow erroneously thinks it supports the proxy status because its part of the same file.One potential option is to only have a single status implementation i.e. the proxy one, and the "worker" just always sends an array of one element in its impl. This would also let us add this as a method instead of as a separate service. The original intention here was to host it on internal ports but I think at this point that intention no longer matters (imo).
Closes #1171
Closes #1341
Closes #935