Description
Problem statement
We have a system where indexers must manually add the aggregator endpoint for specific gateways and we'd like to facilitate the onboarding of new gateways.
But new gateways may be malicious and we need to make sure that the cost to attack is higher than the value they can extract. We currently limit the amount an indexer can lose with the setting max_amount_willing_to_lose
but this must be scaled accordingly to the number of queries per second, so if an attacker targets these big indexers, just the max_amount_willing_to_lose
isn't enough to stop a profit on this attack.
Expectation proposal
We propose an off-chain agreement protocol where gateways can send a registration request and indexers can verify if it's possible to aggregate receipts.
Gateway TAP state machine
graph TD;
Unregistered-->Verifying;
Unregistered-->Blocked;
Verifying-->Blocked;
Blocked-->Verifying;
Verifying-->Allowed;
Allowed-->Denied;
Denied-->Allowed;
Behaviour
Currently, our system has only two states: Allowed
and Denied
. Every gateway that is not on the tap_aggregator_endpoints
map is denied as soon as tap-agent tries to create a SenderAccount
and can't find the aggregator value.
With this proposal, we update our system to have different behaviors depending on which state a sender is.
Unregistered state
This state demands that the sender sends a tap-aggregator header within the first request so it can register the sender and start aggregating. If a query sent by a unregistered sender doesn't have the header, we deny right away with a possible error: "Sender not registered".
We should aggregate the first receipt to verify that the tap-aggregator is working and we can communicate with it. If it works we update to a Verifying
state otherwise we change to a Blocked
state.
Verifying state
In this state, every RAV request should not fail until we reach a certain amount (configurable) where we can trust the sender and transition it to an Allowed
state. If any of the RAV requests fail, we transition to a Blocked
state
Blocked state
In this state, we have a backoff retry process where we keep trying to aggregate the receipts that we have. In case we have a successful aggregation, we transition to a Verifying state.
We should not spend that much resources trying to aggregate, so after some time of backoff (it can be configurable but with small defaults like 1 day), we stop trying to aggregate it.
Senders that are blocked can request to update their tap-aggregator by sending another query, but they won't have their query processed. When indexers get a new receipt, the system should try again in a tentative way to verify the gateway (stopping after the same backoff period).
Allowed state
It's the current normal operation, each sender has a max unaggregated fee that they are allowed to serve before being denied. In case the pending fees is over the escrow balance or the unaggregated fee is bigger than max_willing_to_lose
, we transition to Deny
state.
Deny state
We already have this state in the current system. In this state, we wait for the escrow balance to update, or we keep retrying every 30 seconds to do a RAV request which lowers the unaggregated fee. We then transition to Allowed
state resuming operation.
Tap Aggregator Header
For a gateway to update its tap-aggregator, it must send a signed receipt by one of its signers on the tap contracts. The query handler already demands a receipt
Database modification
- New table responsible for storing tap-aggregator endpoints.
- Upgrade the
deny_list
table to asender_state
table.
New error responses
- We should notify the sender that it should send the header updating the tap-aggregator in the next request.
- It would be nice to have information if the sender is denied or blocked and what is the reason
- Low escrow funds
- Too much pending fees (Tap-Aggregator not working)
- Verification failed
Alternative considerations
Register route
Instead of sending the aggregator through a header, a specific /register
route could be used so we could save traffic on query handler by not sending through a header and just receiving a direct request. This new request would need to receive a receipt that should be aggregated and verified, updating it to the verifying state.
Tap Registry
We also have #94 as another possible solution but it requires a new contract and a new subgraph which is not worth it at the current time. Also, we'd need the same verification protocol to guarantee stability.
Synchronous communication between indexer-service and tap-agent
We used async communication by sharing the same database between those two components, if in any case it seems necessary synchronous communication, we should consider using #84. This should have a deep discussion because this means that indexer-service now would need to know tap-agent address.