-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-008: Variable member rewards based on uptime #6
Conversation
I agree we should all be accountable for our services. the SLA is currently:
We are not using the 'rank' yet?
On SLA of 99.9%, 43 mins of downtime per month is still within SLA. Marking a node as down for 1 hour may not reflect this accurately. Some suggestions of tests / checks
|
It would be decentralized, multiple members would be operating a DNS authority for dotters.network that embeds the necessary functionality.
No, we've been super lax. The ranking system doesn't financially penalize for downtime.
If you don't provide service (for whatever reason) -- you should not be paid as if you did. It's that simple. |
Thanks Tom for this proposal, I strongly agree with the spirit of it, and I am all for its implementation at a more mature time, but I will be rejecting until the following issues are considered:
Hope this helps promoting healthy conversation around this proposal. Thanks again! Milos |
It's been active for 2-3 months now without any issues. How long is sufficient time to determine stability?
I activated WSS and SSL checks as of last weekend. The SSL check makes sure there is a valid SSL certificate for the domain in question and that the SSL certificate has > 5 days till expiry. Minimizing it to 5 days before expiry allows DNS to be update and appropriate time for any possible client side caches to clear. The WSS check here is quite extensive. It checks for...
These checks use the method we could not with the old ibp monitor. It connects to the ip address and simulates a connection to the domain. It allows every member to be checked for full compliance. These checks are also applied to ibp.network domains since the ibp.network domains are configured in the config repos and handled identically to dotters.network. The system is already checking for 100% compliance. (All you would need to do to make ibp.network live on the new dns system is update 2/3 things and then point your registered name servers to the ones configured)
I believe mentioned in the RFC, only periods where all 3 servers see them offline they would be considered offline. I should add that for the past few weeks I don't think anyone has been offline ever due to latency issues. The last one flapping up and down was Metaspan and that went away after he migrated over to fiber uplinks. The ICMP check is meant to be a quick catch if a member goes offline. The WSS and SSL checks only run once per 30 minutes, the ICMP check runs once per minute.
I could work this into the WSS check or into another check. It's not a problem.
This only checks RPC from the load balancer. The point is that services should be available and online.
Latency is not a factor in this equation. I don't understand how distance or latency matters. Multiple members would operate DNS servers, it wouldn't be just me. In addition, we would need to come to agreement on how many need to see a member offline for them to be considered offline.
This would be good to discuss, an acceptable number of offline hours per month. Those could be worked into the calculations so members aren't penalized until after these hours are used.
Boot nodes are not a part of this RFC, purely RPC.
I'm not sure I understand what you're saying here. Can you clarify? The point of this is to get back on track of where we would have been if the old ibp monitor had worked out. For example, there are members who were just paid for services they didn't actually have online for the service period. We're still missing quite a few people-paseo and coretime-paseo nodes. |
I like this and I'm in favor of it. In addition to SLAs (Which determines down-ranking) we should use this method to pay members. This should apply also if members are withing SLA. This way we have the SLA to determine ranks and this to determine how much we pay. We already have quite strict down ranking rules for rank 6 members. But this change would also then cover Rank 5 members and so on. We should have a rule that member that have re-occurring problems should be removed permanently until problems are solved.
We should use the dns servers as source of truth for this since they're the ones taking members in and out of rotation.
A colocation of a member (not single service), has a SLA of 99.99%. If this is breached more than twice a down rank to 5. will occur. @senseless in addition to individual chain monitoring it would be great if you could ping a defined IP of each member, which would be probably the haproxy servers VIP ip. Probes should be done at least every 1-5 minutes. |
I did some work on the DNS and I find the distributed storage frustrating. Could we at least consider a distributed data store? We should also consider registering
|
This is already done once per minute. RE maintenance windows, if we just define a number of acceptable offline hours "baked in", then those would be applied first to scheduled or unscheduled downtime (treating both essentially the same). So, lets say an average of 730 hours per month, 99.9% uptime would be 729.27 hours per month required and allowing for 0.73 hours per month of downtime. 99% uptime would require 722.7 online hours per month and allow for 7.3hours/month of downtime. 99.5% uptime would require 726.35 online hours per month allowing for 3.65 hours / month of downtime. Keep in mind, that anyone permitted to join an anycast cluster should be required to have 99.99% uptime because we can't remove a non-performing member easily from that cluster. That would allow for only 5 minutes of downtime per month. |
Vote as emojis on this comment |
I am voting 'aye' for this in principle, however I would like to see more details before we enact for payment processing:
|
Everything gets tested on every release with manual verification of the data / outputs. For testing on this, we'd want to go a couple of weeks at least with testing to make sure everything is correct. Additionally, before taking it live we need multiple members to operate the nodes and provide DNS resolution which would likely take another couple weeks after everything is confirmed.
Yes, the latency problems you were experiencing went away after you swapped out to fiber. I think you were just getting some packet loss on the copper lines.
This can be expanded if necessary.
This is something where I'm open to suggesting but likely we'd want to go with a 2/3 or 3/3 initially and then expand later on the number of members that are involved.
Of course this is mentioned in the RFC. |
Apologies if I used the wrong term. By
How is this currently done in the GeoDNS implementaton? Does each monitor control a slice of the globe, and determines for itself whether the member/service should be included/excluded? E.g. if DNS01 (USA) finds member/service ping failed, what will happen in the other zones (EU/SG)? |
For additional info & context: We need to find a way that does not penalise the operator for false negative 'downtime'.
|
You can view the RFC here