Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-008: Variable member rewards based on uptime #6

Merged
merged 1 commit into from
Nov 2, 2024

Conversation

senseless
Copy link
Contributor

@senseless senseless commented Sep 22, 2024

You can view the RFC here

@dcolley
Copy link

dcolley commented Sep 24, 2024

I agree we should all be accountable for our services.
I would prefer decentralised monitoring, and that we agree on the metrics and technical measures.

the SLA is currently:

Rankings:
+1 rank for every day where service meets SLA guidelines
-7 rank for every 15 minutes where service is unresponsive.
-1 rank for every 15 minutes where response time is too high.

We are not using the 'rank' yet?

Service-Level-Agreement:
99.9% Uptime
Less than 100ms Response Time (Network Latency - Response Latency = Response Time)
Polled every 5 minutes
Node telemetry exported to endpoints

On SLA of 99.9%, 43 mins of downtime per month is still within SLA.

Marking a node as down for 1 hour may not reflect this accurately.

Some suggestions of tests / checks

  • service availability - penalty: slash 1 hour/service

    • wss connection
    • boot nodes
  • Tests that check config - penalty: slash 1 hour/service

    • max rpc connections
    • offchain indexing
    • archive nodes
  • Tests that simulate user experience - penalty: depends on GeoDNS and external factors, perhaps test every 5 mins, and check for average ms over 1 hour?

    • connectiion time
  • performance tests - we have not agreed on how this should be checked

    • spiking rpc calls to simulate NFT drops etc
    • a ping test of 1000 per second

@senseless
Copy link
Contributor Author

I would prefer decentralised monitoring, and that we agree on the metrics and technical measures.

It would be decentralized, multiple members would be operating a DNS authority for dotters.network that embeds the necessary functionality.

We are not using the 'rank' yet?

No, we've been super lax. The ranking system doesn't financially penalize for downtime.

Marking a node as down for 1 hour may not reflect this accurately.

If you don't provide service (for whatever reason) -- you should not be paid as if you did. It's that simple.

@miloskriz
Copy link

Thanks Tom for this proposal,

I strongly agree with the spirit of it, and I am all for its implementation at a more mature time, but I will be rejecting until the following issues are considered:

  1. the IBP-geodns is a great project but (subjective appreciation ahead) still lacks the refinement and stability required to be used as single source of truth for such a delicate decision as payouts.
  2. Ping storms to the public IPs of each member seem to be part of current approach of IBP-geodns to manage services under dotters.network domain only, so, in any case, this RFC should only be applicable to the estimated 50% of all IBP services supposedly going through that domain.
  3. Responses to ICMP Ping storms are unsuitable to ascertain the availability of individual endpoints and are known to be producing large numbers of both: false positives and false negatives. In conclusion they should be excluded completely as metric used to determine whether an endpoint is up/down.
  4. The ideal metric should be instead the TCP OK response to the /health/readiness call on such individual endpoints. I would expect that a minimum of 3x failed calls spaced 1 minute between each other should be considered a positive failure for that endpoint.
  5. This readiness endpoint is already available in relaychains but may take some time to be available in systemchains / parachains, so again, special / temporal alternatives should be needed.
  6. Only checks directed to the high availability - balanced RPC ports (i.e. 9944) should be considered binding to this variable payment policy (meaning no p2p checks on 30333-similar ports, which potentially report on availability of individual backends).
  7. The measuring of this metric must be taken from within the appointed territory, meaning that for example, services in Oceania must only be measured / enforced and binding within Oceania itself.
  8. This proposal still requires to consider maintenance time allowance for each member, penalising this type of planned downtime discourages good operating practices and may hinder excellence.
  9. Measuring of bootnode availability/functionality should be excluded of this payment policy, as these are provided for free. However members' ethical performance and/or a separate enforcement policy must be devised for such commitment to the community.
  10. In general terms, a sensible approach should be applied to balance out both: the need to measure availability of the members' endpoints, and our duty to clear bandwidth for real users to benefit of the IBP services, too much noise is already going on about the internal use of monitoring requests.

Hope this helps promoting healthy conversation around this proposal.

Thanks again!

Milos

@senseless
Copy link
Contributor Author

senseless commented Sep 27, 2024

1. the `IBP-geodns` is a great project but (subjective appreciation ahead) still lacks the refinement and stability required to be used as single source of truth for such a delicate decision as payouts.

It's been active for 2-3 months now without any issues. How long is sufficient time to determine stability?

2. Ping storms to the public IPs of each member seem to be part of current approach of `IBP-geodns` to manage services under `dotters.network`  domain only, so, in any case, this RFC should only be applicable to the estimated 50% of all IBP services supposedly going through that domain.

I activated WSS and SSL checks as of last weekend. The SSL check makes sure there is a valid SSL certificate for the domain in question and that the SSL certificate has > 5 days till expiry. Minimizing it to 5 days before expiry allows DNS to be update and appropriate time for any possible client side caches to clear.

The WSS check here is quite extensive. It checks for...

  1. Is full archive node (checks for old block)
  2. Is on the correct network (asset hub vs coretime for example)
  3. Has more than 5 peers and is not syncing.

These checks use the method we could not with the old ibp monitor. It connects to the ip address and simulates a connection to the domain. It allows every member to be checked for full compliance.

These checks are also applied to ibp.network domains since the ibp.network domains are configured in the config repos and handled identically to dotters.network. The system is already checking for 100% compliance. (All you would need to do to make ibp.network live on the new dns system is update 2/3 things and then point your registered name servers to the ones configured)

3. Responses to ICMP Ping storms are unsuitable to ascertain the availability of individual endpoints and are known to be producing large numbers of both: false positives and false negatives. In conclusion they should be excluded completely as metric used to determine whether an endpoint is up/down.

I believe mentioned in the RFC, only periods where all 3 servers see them offline they would be considered offline. I should add that for the past few weeks I don't think anyone has been offline ever due to latency issues. The last one flapping up and down was Metaspan and that went away after he migrated over to fiber uplinks.

The ICMP check is meant to be a quick catch if a member goes offline. The WSS and SSL checks only run once per 30 minutes, the ICMP check runs once per minute.

4. The ideal metric should be instead the TCP OK response to the [`/health/readiness` ](https://github.com/paritytech/polkadot-sdk/pull/4802) call on such individual endpoints. I would expect that a minimum of 3x failed calls spaced 1 minute between each other should be considered a positive failure for that endpoint.

I could work this into the WSS check or into another check. It's not a problem.

5. This `readiness` endpoint is already available in relaychains but may take some time to be available in systemchains / parachains, so again, special / temporal alternatives should be needed.

6. Only checks directed to the high availability - balanced RPC ports (i.e. 9944) should be considered binding to this variable payment policy (meaning no p2p checks on 30333-similar ports, which potentially report on availability of individual backends).

This only checks RPC from the load balancer. The point is that services should be available and online.

7. The measuring of this metric must be taken from within the appointed territory, meaning that for example, services in Oceania must only be measured / enforced and binding within Oceania itself.

Latency is not a factor in this equation. I don't understand how distance or latency matters. Multiple members would operate DNS servers, it wouldn't be just me. In addition, we would need to come to agreement on how many need to see a member offline for them to be considered offline.

8. This proposal still requires to consider maintenance time allowance for each member, penalising this type of planned downtime discourages good operating practices and may hinder excellence.

This would be good to discuss, an acceptable number of offline hours per month. Those could be worked into the calculations so members aren't penalized until after these hours are used.

9. Measuring of bootnode availability/functionality should be excluded of this payment policy, as these are provided for free. However members' ethical performance and/or a separate enforcement policy must be devised for such commitment to the community.

Boot nodes are not a part of this RFC, purely RPC.

10. In general terms, a sensible approach should be applied to balance out both: the need to measure availability of the members' endpoints, and our duty to clear bandwidth for real users to benefit of the IBP services, too much noise is already going on about the internal use of monitoring requests.

I'm not sure I understand what you're saying here. Can you clarify? The point of this is to get back on track of where we would have been if the old ibp monitor had worked out. For example, there are members who were just paid for services they didn't actually have online for the service period. We're still missing quite a few people-paseo and coretime-paseo nodes.

@tugytur
Copy link
Contributor

tugytur commented Oct 1, 2024

I like this and I'm in favor of it.

In addition to SLAs (Which determines down-ranking) we should use this method to pay members. This should apply also if members are withing SLA. This way we have the SLA to determine ranks and this to determine how much we pay.

We already have quite strict down ranking rules for rank 6 members. But this change would also then cover Rank 5 members and so on.

We should have a rule that member that have re-occurring problems should be removed permanently until problems are solved.

I would prefer decentralised monitoring, and that we agree on the metrics and technical measures.

We should use the dns servers as source of truth for this since they're the ones taking members in and out of rotation.

On SLA of 99.9%, 43 mins of downtime per month is still within SLA.

A colocation of a member (not single service), has a SLA of 99.99%. If this is breached more than twice a down rank to 5. will occur.

@senseless in addition to individual chain monitoring it would be great if you could ping a defined IP of each member, which would be probably the haproxy servers VIP ip.

Probes should be done at least every 1-5 minutes.
We should also have 'maintenance' mode which would not be counted towards this. We should define a set of maintenance windows members can use to be completely offline. We can easily do this with a member just returning a 0 or 1 on a predefined subdomain/path.

@dcolley
Copy link

dcolley commented Oct 2, 2024

I did some work on the DNS and I find the distributed storage frustrating. Could we at least consider a distributed data store?

We should also consider registering assertions, that can be corroborated or refuted.
For example:

  • monitor 1 asserts that ping failed and hence a service is down.
  • monitors 2,3,..,n observe the assertion and corroborate or refute the assertion.
  • for some measurement, an average could be representative
  • for some, like ping, a more simple {n of m} monitors could confirm that a server ip is up and reachable.

@senseless
Copy link
Contributor Author

@senseless in addition to individual chain monitoring it would be great if you could ping a defined IP of each member, which would be probably the haproxy servers VIP ip.

Probes should be done at least every 1-5 minutes. We should also have 'maintenance' mode which would not be counted towards this. We should define a set of maintenance windows members can use to be completely offline. We can easily do this with a member just returning a 0 or 1 on a predefined subdomain/path.

This is already done once per minute. RE maintenance windows, if we just define a number of acceptable offline hours "baked in", then those would be applied first to scheduled or unscheduled downtime (treating both essentially the same). So, lets say an average of 730 hours per month, 99.9% uptime would be 729.27 hours per month required and allowing for 0.73 hours per month of downtime. 99% uptime would require 722.7 online hours per month and allow for 7.3hours/month of downtime. 99.5% uptime would require 726.35 online hours per month allowing for 3.65 hours / month of downtime.

Keep in mind, that anyone permitted to join an anycast cluster should be required to have 99.99% uptime because we can't remove a non-performing member easily from that cluster. That would allow for only 5 minutes of downtime per month.

@senseless
Copy link
Contributor Author

Vote as emojis on this comment

@dcolley
Copy link

dcolley commented Oct 21, 2024

I am voting 'aye' for this in principle, however I would like to see more details before we enact for payment processing:

  • the implementation (e.g. what test will be done)
    • The GeoDNS monitor alerts have been pretty quiet (for my services) so if that is an indication that the ping straffe test is passing then it seems ok (to me)
  • how monitors communicate to decide if a node is up/down/needs attention (n/m have same result)
  • calculation of uptime/'slash' - this is vary based on # days in the month, or do we go for a 4-week cycle
  • IBP terms allow for 99.9% uptime. This should be factored in, or we should agree to change the documentation to be clear on how things get calculated

@senseless
Copy link
Contributor Author

* the implementation (e.g. what test will be done)

Everything gets tested on every release with manual verification of the data / outputs. For testing on this, we'd want to go a couple of weeks at least with testing to make sure everything is correct. Additionally, before taking it live we need multiple members to operate the nodes and provide DNS resolution which would likely take another couple weeks after everything is confirmed.

  * The GeoDNS monitor alerts have been pretty quiet (for my services) so if that is an indication that the ping straffe test is passing then it seems ok (to me)

Yes, the latency problems you were experiencing went away after you swapped out to fiber. I think you were just getting some packet loss on the copper lines.

* how monitors communicate to decide if a node is up/down/needs attention (n/m have same result)

This can be expanded if necessary.

* calculation of uptime/'slash' - this is vary based on # days in the month, or do we go for a 4-week cycle

This is something where I'm open to suggesting but likely we'd want to go with a 2/3 or 3/3 initially and then expand later on the number of members that are involved.

* IBP terms allow for 99.9% uptime. This should be factored in, or we should agree to change the documentation to be clear on how things get calculated

Of course this is mentioned in the RFC.

@dcolley
Copy link

dcolley commented Oct 28, 2024

For testing on this, we'd want to go a couple of weeks at least with testing to make sure everything is correct

Apologies if I used the wrong term. By test, I meant what tests do we apply to the service (like ping, rpc calls etc.)

how monitors communicate to decide if a node is up/down/needs attention

How is this currently done in the GeoDNS implementaton? Does each monitor control a slice of the globe, and determines for itself whether the member/service should be included/excluded? E.g. if DNS01 (USA) finds member/service ping failed, what will happen in the other zones (EU/SG)?

@CoinStudioDOT CoinStudioDOT merged commit 9b8d7a7 into ibp-network:main Nov 2, 2024
@dcolley
Copy link

dcolley commented Nov 20, 2024

For additional info & context:
Today my unique rpc service was removed from DNS due to low peers. I believe this we due to a chain fault.
The chain recovered within 10 mins but my service was removed for the pre-configured 1 hour.

We need to find a way that does not penalise the operator for false negative 'downtime'.

│2024-11-20T16:50:18.393552+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:18 [Relaychain] Received imported block via RPC: #23496505 (0x04b4 ~@ 80ac -> 0x20d6 ~@ 84ab)         │
│2024-11-20T16:50:18.443625+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:18 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParach│
ainsUnavailable)                                                                                                                                                                           │
│2024-11-20T16:50:18.693970+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:18 [Relaychain] Received imported block via RPC: #23496505 (0x04b4 ~@ 80ac -> 0x88f0 ~@ ab9d)         │
│2024-11-20T16:50:18.744010+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:18 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParach│
ainsUnavailable)                                                                                                                                                                           │
│2024-11-20T16:50:20.695123+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:20 Accepting new connection 3/100                                                                     │
│2024-11-20T16:50:20.695200+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:20 [next_inner] Deserialization to "core::option::Option<sp_rpc::list::ListOrValue<sp_rpc::number::Num│
berOrHex>>" failed. Error: Error("data did not match any variant of untagged enum ListOrValue", line: 0, column: 0), input JSON: "\"latest\"]"                                             │
│2024-11-20T16:50:20.695229+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:20 Error parsing optional "hash" as "Option < ListOrValue < NumberOrHex > >": InvalidParams(data did n│
ot match any variant of untagged enum ListOrValue)                                                                                                                                         │
│2024-11-20T16:50:22.948970+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:22 [Parachain]  ~_~R  Idle (0 peers), best: #6315883 (0x6efa ~@ 85d1), finalized #6315883 (0x6efa ~@ 8│
kiB/s   ~F 8.8kiB/s                                                                                                                                                                        │
│2024-11-20T16:50:24.000155+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:23 [Relaychain] Received finalized block via RPC: #23496503 (0xf52a ~@ 0470 -> 0xa41c ~@ 7b28)        │
│2024-11-20T16:50:24.400390+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:24 [Relaychain] Received imported block via RPC: #23496506 (0x88f0 ~@ ab9d -> 0xb394 ~@ 3561)         │
│2024-11-20T16:50:24.450458+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:24 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParach│
ainsUnavailable)                                                                                                                                                                           │
│2024-11-20T16:50:24.600662+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:24 [Relaychain] Received imported block via RPC: #23496506 (0x88f0 ~@ ab9d -> 0x3bfc ~@ 5512)         │
│2024-11-20T16:50:24.650801+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:24 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParach│
ainsUnavailable)                                                                                                                                                                           │
│2024-11-20T16:50:24.700833+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:24 [Relaychain] Received imported block via RPC: #23496506 (0x88f0 ~@ ab9d -> 0x9f98 ~@ 946a)         │
│2024-11-20T16:50:24.750900+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:24 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParach│
ainsUnavailable)                                                                                                                                                                           │
│2024-11-20T16:50:25.752363+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:25 [Parachain]  ~\  Imported #6315884 (0xde01 ~@ d22a)                                                │
│2024-11-20T16:50:26.002382+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:25 [Parachain]  ~\  Imported #6315884 (0x1630 ~@ a32d)                                                │
│2024-11-20T16:50:26.853711+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:26 [Parachain]  ~\  Imported #6315884 (0xd5f2 ~@ 0461)                                                │
│2024-11-20T16:50:27.954882+00:00 unique-rpc1 unique-collator[435]: 2024-11-20 16:50:27 [Parachain]  ~_~R  Idle (6 peers), best: #6315883 (0x6efa ~@ 85d1), finalized #6315883 (0x6efa ~@ 8│
1kiB/s   ~F 15.0kiB/s  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants