-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delayed RPC Send Using Tokens #5923
base: unstable
Are you sure you want to change the base?
Conversation
29e3f00
to
90361d6
Compare
90361d6
to
7e0c630
Compare
# Conflicts: # beacon_node/lighthouse_network/src/rpc/handler.rs
…me protocol per peer
# Conflicts: # beacon_node/lighthouse_network/tests/rpc_tests.rs
# Conflicts: # beacon_node/lighthouse_network/tests/rpc_tests.rs
This is ready for another review. 🙏 I have added a concurrency limit on the self-lmiter. Now, the self-limiter limits outbound requests based on both the number of concurrent requests and tokens (optional). Whether we also need to limit tokens in the self-limiter is still under duscussion. Let me know if you have any ideas. (FYI) I also ran lighthouse (this branch) on the testnet for ~24hours. During this time, the LH node responded with 21 RateLimited errors due to the number of active requests. These errors appear in the logs like the example below. Note that this is about inbound requests, not the self-limiter (outbound requests).
|
# Conflicts: # beacon_node/lighthouse_network/src/rpc/mod.rs # beacon_node/lighthouse_network/src/rpc/self_limiter.rs
@pawanjay176 @jxs @dapplion @jimmygchen - If anyone has any spare time, I think this is a good one to get in. |
I removed |
# Conflicts: # beacon_node/lighthouse_network/src/rpc/mod.rs # beacon_node/lighthouse_network/src/rpc/self_limiter.rs
mod self_limiter; | ||
|
||
static NEXT_REQUEST_ID: AtomicUsize = AtomicUsize::new(1); | ||
|
||
// Maximum number of concurrent requests per protocol ID that a client may issue. | ||
const MAX_CONCURRENT_REQUESTS: usize = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we can have at most two blocks_by_range
ReqResp requests per peer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. Each peer can handle up to two blocks_by_range ReqResp requests at most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall architecture looks good! Just some comments.
Could you add a metric to track the time our own requests are idling in the self-rate limiter? It will help inform is sync performance is hindered by this new rate limit policy. We should also track the time outbound responses are idling in the rate limiter.
peer_id, | ||
connection_id, | ||
substream_id, | ||
response, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consider the memory cost of storing responses too low to worry? The worst case is buffering 2 x data_columns_by_range
for all possible indices for all peers. For 9 blobs per block, that's 131072*2 * 9 * 100 / 1e6 = 235 MB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the point of this PR to not have an additional global rate limiter? It would help to reduce the worst case scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback! I hadn't considered a global rate limiter, and your point about its potential to mitigate worst-case memory usage is well taken. 💡 I'll look into what configuration values would be optimal.
@@ -139,8 +178,14 @@ impl<Id: ReqId, E: EthSpec> SelfRateLimiter<Id, E> { | |||
if let Entry::Occupied(mut entry) = self.delayed_requests.entry((peer_id, protocol)) { | |||
let queued_requests = entry.get_mut(); | |||
while let Some(QueuedRequest { req, request_id }) = queued_requests.pop_front() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove a QueuedRequest
item from queued_requests
and the limiter doesn't allow to send, is this request lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QueuedRequest
that the limiter did not allow is requeued in queued_requests
here:
lighthouse/beacon_node/lighthouse_network/src/rpc/self_limiter.rs
Lines 189 to 195 in 1f521d6
Err((rate_limited_req, wait_time)) => { | |
let key = (peer_id, protocol); | |
self.next_peer_request.insert(key, wait_time); | |
queued_requests.push_front(rate_limited_req); | |
// If one fails just wait for the next window that allows sending requests. | |
return; | |
} |
Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Akihito, thanks for following this. left some comments, specially regarding limiting our responses.
I think we should just rate limit incoming requests.
@@ -314,9 +326,10 @@ where | |||
if matches!(self.state, HandlerState::Deactivated) { | |||
// we no longer send responses after the handler is deactivated | |||
debug!(self.log, "Response not sent. Deactivated handler"; | |||
"response" => %response, "id" => inbound_id); | |||
"response" => %response, "id" => inbound_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this indentation is not correct right?
/// Rate limiter | ||
limiter: Option<RateLimiter>, | ||
/// Rate limiter for our responses. | ||
response_limiter: Option<ResponseLimiter<E>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response_limiter: Option<ResponseLimiter<E>>, | |
inbound_rate_limiter: Option<ResponseLimiter<E>>, |
I would keep the previous naming, to also still follow the naming of the configurations. But mostly what we are rate limiting are external requests right? Not our responses
@@ -606,6 +606,20 @@ pub enum ResponseTermination { | |||
LightClientUpdatesByRange, | |||
} | |||
|
|||
impl ResponseTermination { | |||
pub fn protocol(&self) -> Protocol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following the naming conventions this could be as_protocol
.
pub fn protocol(&self) -> Protocol { | |
pub fn as_protocol(&self) -> Protocol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know the naming conventions. 💡 Thanks!
/// Rate limiter for our own requests. | ||
self_limiter: Option<SelfRateLimiter<Id, E>>, | ||
outbound_request_limiter: SelfRateLimiter<Id, E>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outbound_request_limiter: SelfRateLimiter<Id, E>, | |
outbound_rate_limiter: SelfRateLimiter<Id, E>, |
Here I would follow the same naming, so that it's clear that both outbound and inbound we are rate limiting requests right?
request.substream_id, | ||
event.clone(), | ||
) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we then leave a comment explain that Akihito, as we have on send_request
?
Thanks!
|
||
// We need to insert the request regardless of whether it is allowed by the limiter, | ||
// since we send an error response (RateLimited) if it is not allowed. | ||
self.active_inbound_requests.insert(id, request.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, this is kinda fishy, may expose us to DOS attacks right? even if we are rate limiting the requester client we are still incrementing active_inbound_requests
unboundly.
Can we rework send_response
for an internal version that accepts the request_id
? So that we only keep the requests that are not rate limited on active_inbound_requests
})); | ||
} | ||
HandlerEvent::Ok(rpc) => { | ||
// Inform the limiter that a response has been received. | ||
match &rpc { | ||
RPCReceived::Request(_) => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
can we move this sub match
arms to the main match
to avoid the unreachable
?
I feel it's more elegant to repeat the self.events.push
code than to have an unreachable
here, wdyt Akihito?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's much simpler. Thanks!
Co-authored-by: João Oliveira <hello@jxs.pt>
Issue Addressed
closes #5785
Proposed Changes
The diagram below shows the differences in how the receiver (responder) behaves before and after this PR. The following sentences will detail the changes.
Is there already an active request with the same protocol?
This check is not performed in
Before
. This is taken from the PR in the consensus-spec, which proposes updates regarding rate limiting and response timeout.https://github.com/ethereum/consensus-specs/pull/3767/files
The PR mentions the requester side. In this PR, I introduced the
ActiveRequestsLimiter
for theresponder
side to restrict more than two requests from running simultaneously on the same protocol per peer. If the limiter disallows a request, the responder sends a rate-limited error and penalizes the requester.Rate limit reached?
andWait until tokens are regenerated
UPDATE: I moved the limiter logic to the behaviour side. #5923 (comment)
The rate limiter is shared between the behaviour and the handler. (Arc<Mutex<RateLimiter>>>
) The handler checks the rate limit and queues the response if the limit is reached. The behaviour handles pruning.I considered not sharing the rate limiter between the behaviour and the handler, and performing all of these either within the behaviour or handler. However, I decided against this for the following reasons:Regarding performing everything within the behaviour: The behaviour is unable to recognize the response protocol whenRPC::send_response()
is called, especially when the response isRPCCodedResponse::Error
. Therefore, the behaviour can't rate limit responses based on the response protocol.Regarding performing everything within the handler: When multiple connections are established with a peer, there could be multiple handlers interacting with that peer. Thus, we cannot enforce rate limiting per peer solely within the handler. (Any ideas? 🤔 )Additional Info
Naming
I have renamed the fields of the behaviour to make them more intuitive:
Testing
I have run beacon node with this changes for 24hours, it looks work fine.
The rate-limited error has not occurred anymore while running this branch.
Metrics
These metrics have been added in this PR.
self_limiter_request_idling_seconds
: The time our own request remained idle in the self-limiter.