-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XDS Management Server #360
Conversation
This adds a management server based on go-control-plane that fetches resources (clusters, filters) from some source and pushes that config to all connected proxies. The current only implemented source is a file watcher that watches changes to a config file on disk - probably only useful at least for testing, other types like fetching from k8s will be added later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though obviously I have very little opinion on the Go code. 😄
need to figure out where this should live, either here or in a separate repo and have some CI for it too.
I think this should live here if possible, only if we want to have separate releases, or it would be difficult to build and release both, should we have a separate repo. It being here simplifies overhead and lets us do codegen more easily.
Would love some thoughts on how we want to handle generated proto code, currently I checked in the generated code but not sure if there's a saner approach with go - will port over other filters when its clearer how we want to do this
I think this is related to #346 and I'll write up some more thoughts there.
I concur on this one. Let's keep things in a single repo until we hit such pain that it forces us to move it out 👍🏻
Agree on more discussion with #346, but no issue on my end with checking in generated Go code. There's no This is super cool! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick feedback on Go:
Running goimports
over the code, results in a few sorting/formatting changes:
➜ xds-management-server git:(pr/iu/management-server) goimports -w .
➜ xds-management-server git:(pr/iu/management-server) ✗ gst
On branch pr/iu/management-server
Your branch is up to date with 'upstream/iu/management-server'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: filters/debug/v1alpha1/debug.pb.go
modified: pkg/resources/resources.go
Running my favourite mega-linter over the code, resulting in one issue:
➜ xds-management-server git:(pr/iu/management-server) golangci-lint run
pkg/resources/resources.go:17:2: SA1019: package github.com/golang/protobuf/jsonpb is deprecated: Use the "google.golang.org/protobuf/encoding/protojson" package instead. (staticcheck)
"github.com/golang/protobuf/jsonpb"
^
Would also love to see some docstrings on the Go functions, especially the public ones 😄
But this is very cool to see in action 👍🏻
Okay the PR should be good for a full review now. It has changed a bit since the initial description, it now includes support to watch agones gameservers as upstream endpoints and push custom filterchains to proxies running as kubernetes pods (configured via annotations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, this is so neat. Apologies it took me a while to dig into it.
Note - each of these files need an Apache header please.
I'm digging through it - I think we need a README to cover what it is doing and how to use it / use it as a base etc.
I expect this will need another pass or two, since it's a chunky PR, and I have to run to do some other things, but wanted to get you at least this feedback.
This is exciting stuff!
ctx, shutdown := context.WithCancel(context.Background()) | ||
defer shutdown() | ||
|
||
k8sConfig, err := k8s.GetK8sConfig(logger, flags.KubeHost, flags.KubeConfigPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun question for this: Does this mean there will be one xDS server per Kubernetes cluster?
An interesting assumption 😁 (lots of pros and cons I'm sure) - we should probably make a note of this in the README though.
Not got any strong opinions, especially as this is an example, but just highlighting it so we make a note of it somewhere in some documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not an assumption, rather this is an mvp and what I think we need to test things out on our end. If it turns out we do need multiple clusters I'll add it but otherwise I'll circle back to it eventually or PR welcome if someone else wants to. We can create an issue to track it if we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes perfect sense 👍🏻 just want to make sure we cover this in a README.
p.config.GameServersNamespace, | ||
fields.Everything()) | ||
|
||
gameServerInformer := cache.NewSharedInformer(gameServerListWatch, &agonesv1.GameServer{}, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create a SharedInformerFactory (example, and then use the event system to track when GameServers
are added an deleted - rather than polling every 100ms. Probably only need Update and Delete events. (example)
Then you can also use a Lister
to query your GameServers (even if you want to get the full list on each run), and you can know it's backed by the local cache and you aren't hammering on the K8s api.
You might want to create the SharedInformerFactory in controller.go and pass it into the constructor for the agones Provider -- that way you can build out fakes and mock the operations in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following, why do we want to track the events ourselves? we'd still need to build up a cache and check periodically which the cache.Store already does for us? Also I thought the sharedInformer store is a cache 😮 where's the k8s api hammering occurring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following, why do we want to track the events ourselves?
My thought with using events, rather than polling - was that then we end up with less no-change events going out to xDS subscriptions, and/or we don't have to track ourselves that changes have been made between each poll ourselves either.
where's the k8s api hammering occurring here?
Sorry, I wasn't clear -- my concern was that if we somehow misconfigured the cache (although in retrospect that seems unlikely).
The other thing I like about a SharedInformerFactory
is that you won't have to do all that work to setup the cache, it will do all that for you, so I think it'll make the code simpler.
Also, a SharedInformerFactory
is a standard way of implementing K8s controllers / interacting with K8s, so it's an API surface that people are used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought with using events, rather than polling - was that then we end up with less no-change events going out to xDS subscriptions, and/or we don't have to track ourselves that changes have been made between each poll ourselves either.
How do we end up with no-change events going to XDS in the current impl? I'm still not entirely sure I understand the concern correctly but the idea is that we send an XDS update only when the cluster changes (not every gameserver event translates to a cluster change). not sure how we'd do this without remembering the previous cluster state?
Then by periodically checking for updates to send to XDS, we have control over how often we're broadcasting to the proxies (so we batch updates to them) rather than triggering a broadcast for every gameserver change (which we can't control).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then by periodically checking for updates to send to XDS, we have control over how often we're broadcasting to the proxies (so we batch updates to them) rather than triggering a broadcast for every gameserver change (which we can't control).
Oh that is an interesting point! That being said, when would we not want it to be updated in as close to realtime as possible? 🤔
but the idea is that we send an XDS update only when the cluster changes
I didn't manage to catch this code on first pass. I'll go hunting for it again.
If we did use the event system, we could track GameServer add
and delete
events - which would always be a change in endpoints (we could ignore updates - as GameServer's can't change their IP or ports). On each of those events you would then know that the GameServer has been added or removed, so one could send a delta event rather than a full snapshot, without having to do your own comparison of snapshots. (or do a delta to the internal snapshot, and send at that time).
You wouldn't have to check that the stored snapshots had changed then, as you know that there is a definite change.
The downside here -- we may want to do an full update of all GameServer anyway on xDS server startup / just in case - so we may end up in the same place.
Not 100% wedded to this idea, but just wanted to explore to see if it made things easier / implementation simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could ignore updates - as GameServer's can't change their IP or ports
Tokens could change too which would be update events.
On each of those events you would then know that the GameServer has been added or removed, so one could send a delta event rather than a full snapshot
The xds snapshot cache itself currently can only be updated with full snapshots rather than delta events. and with the xds delta protocol too, when a new node joins we need to send a full snapshot to it first before any deltas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tokens could change too which would be update events.
Ah yep, I see that now.
The xds snapshot cache itself currently can only be updated with full snapshots rather than delta events.
Aaah, in which case, there isn't much gain from using events then. Then I recind my suggestion to use events.
I do think we should use a SharedInformerFactory - it's less setup to create, and means you can also share the cache if that becomes a requirement in the future.
I found a nice SO post on breaking down the various layers of abstraction:
https://stackoverflow.com/a/59544592
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THis is looking great 😃
p.config.GameServersNamespace, | ||
fields.Everything()) | ||
|
||
gameServerInformer := cache.NewSharedInformer(gameServerListWatch, &agonesv1.GameServer{}, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following, why do we want to track the events ourselves?
My thought with using events, rather than polling - was that then we end up with less no-change events going out to xDS subscriptions, and/or we don't have to track ourselves that changes have been made between each poll ourselves either.
where's the k8s api hammering occurring here?
Sorry, I wasn't clear -- my concern was that if we somehow misconfigured the cache (although in retrospect that seems unlikely).
The other thing I like about a SharedInformerFactory
is that you won't have to do all that work to setup the cache, it will do all that for you, so I think it'll make the code simpler.
Also, a SharedInformerFactory
is a standard way of implementing K8s controllers / interacting with K8s, so it's an API surface that people are used to.
@markmandel I think I've addressed all the comments for this now, PTAL 🙏 |
podInformer := informerFactory.Core().V1().Pods().Informer() | ||
go podInformer.Run(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
podInformer := informerFactory.Core().V1().Pods().Informer() | |
go podInformer.Run(ctx.Done()) | |
informerFactory.Start(ctx.Done()) |
Rather than starting individual informers, start the InformerFactory as a whole, and then people don't need to know to start informers as needed.
https://pkg.go.dev/k8s.io/client-go@v0.22.2/informers/internalinterfaces#SharedInformerFactory
(Also, if you want to create this in controller.go
and pass it down, this should make things much easier)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really able to use Start
it seems, that function doesn't work in the unit tests with the fake client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 not quite sure what the issue here is blocking you. Can you expand?
I generally don't use NewSimpleClientset
personally, I find it that it's always far more simple than I usually require when testing K8s controller code (for example, I saw the note about update not changing object generation on update).
In Agones, here is where we create our fake Factories:
https://github.com/googleforgames/agones/blob/main/pkg/testing/controller.go#L49-L62
We have a generic StartInformers
function, since they always get started at the top level of the code hierarchy:
https://github.com/googleforgames/agones/blob/main/pkg/testing/controller.go#L49-L62 , and using cache.WaitForCacheSync
ensures the informer/lister cache is populated.
(Also common pattern in controller's startup, but I figured we wouldn't mind if we didn't wait for sync on startup here as much).
Then when wanting to populate with data, I create a reactor, and populate it's "list" function with the requisite objects:
https://github.com/googleforgames/agones/blob/e048859c6ff3da0d4b299f915113993aa49c7865/pkg/fleets/fleets_test.go#L60-L62
Here's making things happen on update:
https://github.com/googleforgames/agones/blob/445d09d68a9a70772d8aa045f57fd09f2467eb03/pkg/fleetautoscalers/controller_test.go#L334-L340
If you want to test events (or use this inside an "update" event to maintain cache values), here's how to use a fakeWatch:
https://github.com/googleforgames/agones/blob/e048859c6ff3da0d4b299f915113993aa49c7865/pkg/gameservers/pernodecounter_test.go#L43-L44
You an see this pattern play out here as well:
https://github.com/kubernetes/sample-controller/blob/master/main.go
Interestingly - the same-controller test uses a different pattern for population, but they are using a SimpleClientSet
- so that might work better in this scenario:
https://github.com/kubernetes/sample-controller/blob/master/controller_test.go#L84
Does that help unblock you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just touching base to see if this solved your issue, or maybe this is something I can help with over a video call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I haven't found time to try it out yet, will make some time soon to get back to this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the write up! I've updated the pr now PTAL
0, | ||
externalversions.WithNamespace(p.config.GameServersNamespace)) | ||
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer() | ||
gameServerStore := gameServerInformer.GetStore() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I go into detail on k8s.go below (noticed it there first):
- You can use a Lister rather than Store, and then you don't have to cast.
- Better to start the whole InformerFactory, rather than individual informers, as then you can create the SharedInformerFactory higher up in the module hierarchy, and not have to worry about which informers have been started or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the suggestions I left are about using a Lister rather than an Informer inside the Agones controller code 👍🏻
0, | ||
externalversions.WithNamespace(flags.GameServersNamespace)) | ||
informerFactory.Start(ctx.Done()) | ||
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer() | |
gameServerLister := informerFactory.Agones().V1().GameServers().Lister() |
Switching everything to use a lister is going to remove all the casting you do internally.
You will need to filter on each List(...)
call, or use the same approach as you did below with WithTweakListOptions(...)
. Either way is fine as far as I am concerned.
informerFactory.Start(ctx.Done()) | ||
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer() | ||
|
||
return agonescluster.NewProvider(logger, gameServerInformer, agonescluster.Config{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return agonescluster.NewProvider(logger, gameServerInformer, agonescluster.Config{ | |
return agonescluster.NewProvider(logger, gameServerLister, agonescluster.Config{ |
return k8sfilterchain.NewProvider( | ||
logger, | ||
clock.RealClock{}, | ||
informerFactory.Core().V1().Pods().Lister(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you use the Lister here, but not above 😄 Would be good to be consistent between the two.
type Provider struct { | ||
config Config | ||
logger *log.Logger | ||
gsInformer cache.SharedIndexInformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gsInformer cache.SharedIndexInformer | |
gsLister agonesv1.GameServerLister |
ctx context.Context, | ||
logger *log.Logger, | ||
gameServersPollInterval time.Duration, | ||
gameServerStore cache.Store, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gameServerStore cache.Store, | |
gsLister agonesv1.GameServerLister, |
|
||
func getEndpointsFromStore( | ||
logger *log.Logger, | ||
gameServerStore cache.Store, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gameServerStore cache.Store, | |
gsLister agonesv1.GameServerLister, |
logger *log.Logger, | ||
gameServerStore cache.Store, | ||
) []cluster.Endpoint { | ||
gameServers := gameServerStore.List() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gameServers := gameServerStore.List() | |
gameServers := gsLister.List() |
|
||
var endpoints []cluster.Endpoint | ||
for i := range gameServers { | ||
gs := gameServers[i].(*agonesv1.GameServer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gs := gameServers[i].(*agonesv1.GameServer) | |
gs := gameServers[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 😄
This adds a management server based on go-control-plane that fetches
resources (clusters, filters) from some source and pushes that config
to all connected proxies.
the original plan was to write a tiny server that we can run tests against, where quilkin takes in config from an actual server. Quickly realised the implementation is simple enough that it covers both test and actual scenarios so I figure it'll be the same project (same for fulfulling #233)
The server itself is implemented and works, it can only fetch and live reload resources from disk but adding support for other sources like k8s is a matter of adding an interface implementation that can talk to that source.
Some things are left todo though, so I'm leaving this as a draft in the meantime but figured to gather thoughts/feedback
need to figure out where this should live, either here or in a separate repo andhave some CI for it too.closes #131