-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Server-Side Apply for GameServer updates #1907
Comments
Mind if I attempt to summarise, just so I can ensure I've understood the issue?
Did I understand correctly? I can dig into a lot of the reasons why there are conflicts, how they are very normal for a k8s controller system, how the informer cache works, and why optimistic locking is extremely important to a lot of the system, but I wanted to make sure I understood the ticket first before going deeper in. That also being said - there may be places that server side apply will work better, but we should be careful with it, and make sure we test appropriately at every stage. |
Digging in some more, server side apply is still in beta, and seems to be strongly in flux at the moment (even into 1.18.x): There also does not seem to be a way to do server side applications from client-go: |
The core observation was that for a single GameServer resource, we would see at least two different The performance issue was that each of these Nothing else was operating on that GameServer object at this point in its lifecycle, and the stats reported to Prometheus showed that almost all the load on the API Server was
At later points in its lifecycle, we did see correct Perhaps instead we could use So that's the first two dot-points. The third one is indeed a guess that Server-Side Apply will help in this situation. If we need more detailed numbers/graphs from the tests in June, I'll try and get them when I'm back in the office on Monday, but my understanding was at the time a colleague of mine raised this issue on a Slack channel, and the request was to try and reproduce these with the Agones existing load-tests, as they weren't being observed on GCP. |
Thanks for the write up! 👍 I feel like we need to take a step back here, and be problem first, not solution first, because I'm seeing some good ideas through here, but also lots of suggestions that would cause significant breakages in Agones because of misunderstandings of how certain parts of the system work. If possible, can we get:
From your write up above, it sounds like it might be possible there is a logic issue in syncGameServer, or other logic optimisations that might occur. But unfortunately this is hard to determine this without breaking down the problem a bit further in a way that is reproducable, or doing some debugging to specifically confirm that the issue described is occurring as described. (I'm wondering if we could confirm this with a new unit test?). |
Something I've just recalled from June... We found that the scaling problems were much better (still happened, but higher ceiling) if we stopped adding data to the GameServer data size is a scaling factor because However, this highlights that actually using Server-Side Apply will require some refactoring, as only the Controller should own the Status field, and the SDKServer would be updating a different field, e.g., |
Server side apply is stable (GA) as of Kubernetes 1.22. We are currently supporting 1.21 and I think it's time to reassess this request. We can consider moving now, or after we have updated our kubernetes support to 1.22 so that we are only relying on stable APIs. |
It's been a year and we haven't gotten to this yet. We have, however, updated Kubernetes a few more times and are now well into the range where we could rely on server side apply being available in the clusters where Agones is running. |
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
Is your feature request related to a problem? Please describe.
We have previously observed an issue an issue where the Agones Controller and SDKServer generate a lot of 409-conflict load on the API Server with
UPDATE
commands, because the Controller would send sequential updates with the sameresourceVersion
.(I thought we lodged a bug report at the time, but I can't find it now... I guess the above comment plus a Slack conversation @pgilfillan mentioned having about this is all the follow-up we did.)
At the time we were unclear on why the Controller would be generating conflicts when it was also the most-recent updater of that object but I noticed the following in the Server-Side Apply documentation:
I suspect that this is the problem, and that a faulty assumption has been made that the update sent from the the Controller is updating the local cache, but from observation of the apiserver logs this is not guaranteed.
I suspect this also would help with #1856.
Describe the solution you'd like
Move the Controller and SDKServer to use Server-Side Apply at least for GameServers.
This feature is beta as of 1.16, which already expired upstream (k8s) support in September 2020. GCP and AKS do not support 1.15 or older, and EKS will expire 1.15 in May 2021.
So this would either make Agones require k8s 1.16 or later, or would need to be optional/selected based on the k8s version running.
Describe alternatives you've considered
I was considering suggesting replacing the
UPDATE
withPATCH
, but right now my understanding of the code-flow is that we rely onUPDATE
's optimistic-locking viaresourceVersion
to avoid Controller and SDKServer conflicting over theStatus
field, specifically in the case where a SDKServer tries to mark itself Allocated at the same time as the Controller tries to mark that GameServer as Unhealthy (because allocation took too long to complete). One of those updates must fail, or the SDKServer and associated game server code may believe itself live and proceed to notify assigned players through a matchmaker, even though the GameServer and Pod will be deleted and cleaned up by the Controller soon after.Simply replacing
UPDATE
withPATCH
would move from optimistic locking to last-call-wins, so that's not a useable approach. We want a conflict in this case, but should not have conflicts in the usual case.It's possible the
PATCH
workflow could be made to work safely, which would massively reduce the load on the API Server (the limiting factor in our scaling tests in May/June this year), but I suspect that would require some refactoring of the GameServer CRD to apply an SDKServer to separately identify its state from the desired state coming from the Controller.Additional context
The relevant part of the linked comment:
Especially concerning, I recall we observed conflicts with the Controller and itself. It's been 6 months, but my recollection is that we traced a clear example through to the logic for dynamic port allocation, which sends a first
UPDATE
to sync port allocation state, and then immediately sends anotherUPDATE
, the latter update flow is also shared with the static-port code-flow. We were seeing the secondUPDATE
with the sameresourceVersion
as the firstUPDATE
, so clearly the Controller had sent a new update but against an old state.The rest of the flows in
syncGameServer
do not (from memory) trigger multipleUPDATE
calls, as each of them depends on some external change, e.g., Pod creation or status update from the SDKServer, to continue to the next step.As mentioned, this was a while ago, but I suspect we had planned, but never followed-up, to reproduce the conflict/performance issue using Agones own scaling tests, which might be why we never turned the comment into a fully-fledged issue report.
The text was updated successfully, but these errors were encountered: