Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XDS Management Server #360

Merged
merged 23 commits into from
Nov 10, 2021
Merged

XDS Management Server #360

merged 23 commits into from
Nov 10, 2021

Conversation

iffyio
Copy link
Collaborator

@iffyio iffyio commented Aug 10, 2021

This adds a management server based on go-control-plane that fetches
resources (clusters, filters) from some source and pushes that config
to all connected proxies.

the original plan was to write a tiny server that we can run tests against, where quilkin takes in config from an actual server. Quickly realised the implementation is simple enough that it covers both test and actual scenarios so I figure it'll be the same project (same for fulfulling #233)

The server itself is implemented and works, it can only fetch and live reload resources from disk but adding support for other sources like k8s is a matter of adding an interface implementation that can talk to that source.

Some things are left todo though, so I'm leaving this as a draft in the meantime but figured to gather thoughts/feedback

  • need to figure out where this should live, either here or in a separate repo and have some CI for it too.
  • Only debug filter is currently supported. Would love some thoughts on how we want to handle generated proto code, currently I checked in the generated code but not sure if there's a saner approach with go - will port over other filters when its clearer how we want to do this
  • no tests ^^ tested manually though, will follow up with unit tests before removing from draft

closes #131

This adds a management server based on go-control-plane that fetches
resources (clusters, filters) from some source and pushes that config
to all connected proxies.

The current only implemented source is a file watcher that watches changes to
a config file on disk - probably only useful at least for testing,
other types like fetching from k8s will be added later on.
Copy link
Collaborator

@XAMPPRocky XAMPPRocky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though obviously I have very little opinion on the Go code. 😄

need to figure out where this should live, either here or in a separate repo and have some CI for it too.

I think this should live here if possible, only if we want to have separate releases, or it would be difficult to build and release both, should we have a separate repo. It being here simplifies overhead and lets us do codegen more easily.

Would love some thoughts on how we want to handle generated proto code, currently I checked in the generated code but not sure if there's a saner approach with go - will port over other filters when its clearer how we want to do this

I think this is related to #346 and I'll write up some more thoughts there.

examples/xds-management-server/cmd/controller.go Outdated Show resolved Hide resolved
@markmandel
Copy link
Member

markmandel commented Aug 12, 2021

I think this should live here if possible, only if we want to have separate releases, or it would be difficult to build and release both, should we have a separate repo. It being here simplifies overhead and lets us do codegen more easily.

I concur on this one. Let's keep things in a single repo until we hit such pain that it forces us to move it out 👍🏻

Would love some thoughts on how we want to handle generated proto code

Agree on more discussion with #346, but no issue on my end with checking in generated Go code. There's no build.rs in Go, so I've no concerns with vendoring this type of code. (I've done this a bunch).

This is super cool!

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick feedback on Go:

Running goimports over the code, results in a few sorting/formatting changes:

➜  xds-management-server git:(pr/iu/management-server) goimports -w .
➜  xds-management-server git:(pr/iu/management-server) ✗ gst
On branch pr/iu/management-server
Your branch is up to date with 'upstream/iu/management-server'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   filters/debug/v1alpha1/debug.pb.go
        modified:   pkg/resources/resources.go

Running my favourite mega-linter over the code, resulting in one issue:

➜  xds-management-server git:(pr/iu/management-server) golangci-lint run
pkg/resources/resources.go:17:2: SA1019: package github.com/golang/protobuf/jsonpb is deprecated: Use the "google.golang.org/protobuf/encoding/protojson" package instead. (staticcheck)
        "github.com/golang/protobuf/jsonpb"
        ^

Would also love to see some docstrings on the Go functions, especially the public ones 😄

But this is very cool to see in action 👍🏻

examples/xds-management-server/pkg/resources/resources.go Outdated Show resolved Hide resolved
examples/xds-management-server/pkg/resources/resources.go Outdated Show resolved Hide resolved
@iffyio iffyio marked this pull request as ready for review August 26, 2021 18:15
@iffyio
Copy link
Collaborator Author

iffyio commented Aug 26, 2021

Okay the PR should be good for a full review now. It has changed a bit since the initial description, it now includes support to watch agones gameservers as upstream endpoints and push custom filterchains to proxies running as kubernetes pods (configured via annotations)

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, this is so neat. Apologies it took me a while to dig into it.

Note - each of these files need an Apache header please.

I'm digging through it - I think we need a README to cover what it is doing and how to use it / use it as a base etc.

I expect this will need another pass or two, since it's a chunky PR, and I have to run to do some other things, but wanted to get you at least this feedback.

This is exciting stuff!

cloudbuild.yaml Show resolved Hide resolved
ctx, shutdown := context.WithCancel(context.Background())
defer shutdown()

k8sConfig, err := k8s.GetK8sConfig(logger, flags.KubeHost, flags.KubeConfigPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun question for this: Does this mean there will be one xDS server per Kubernetes cluster?

An interesting assumption 😁 (lots of pros and cons I'm sure) - we should probably make a note of this in the README though.

Not got any strong opinions, especially as this is an example, but just highlighting it so we make a note of it somewhere in some documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not an assumption, rather this is an mvp and what I think we need to test things out on our end. If it turns out we do need multiple clusters I'll add it but otherwise I'll circle back to it eventually or PR welcome if someone else wants to. We can create an issue to track it if we want

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes perfect sense 👍🏻 just want to make sure we cover this in a README.

examples/xds-management-server/pkg/snapshot/snapshot.go Outdated Show resolved Hide resolved
examples/xds-management-server/pkg/snapshot/snapshot.go Outdated Show resolved Hide resolved
p.config.GameServersNamespace,
fields.Everything())

gameServerInformer := cache.NewSharedInformer(gameServerListWatch, &agonesv1.GameServer{}, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create a SharedInformerFactory (example, and then use the event system to track when GameServers are added an deleted - rather than polling every 100ms. Probably only need Update and Delete events. (example)

Then you can also use a Lister to query your GameServers (even if you want to get the full list on each run), and you can know it's backed by the local cache and you aren't hammering on the K8s api.

You might want to create the SharedInformerFactory in controller.go and pass it into the constructor for the agones Provider -- that way you can build out fakes and mock the operations in tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following, why do we want to track the events ourselves? we'd still need to build up a cache and check periodically which the cache.Store already does for us? Also I thought the sharedInformer store is a cache 😮 where's the k8s api hammering occurring here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following, why do we want to track the events ourselves?

My thought with using events, rather than polling - was that then we end up with less no-change events going out to xDS subscriptions, and/or we don't have to track ourselves that changes have been made between each poll ourselves either.

where's the k8s api hammering occurring here?

Sorry, I wasn't clear -- my concern was that if we somehow misconfigured the cache (although in retrospect that seems unlikely).

The other thing I like about a SharedInformerFactory is that you won't have to do all that work to setup the cache, it will do all that for you, so I think it'll make the code simpler.

Also, a SharedInformerFactory is a standard way of implementing K8s controllers / interacting with K8s, so it's an API surface that people are used to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought with using events, rather than polling - was that then we end up with less no-change events going out to xDS subscriptions, and/or we don't have to track ourselves that changes have been made between each poll ourselves either.

How do we end up with no-change events going to XDS in the current impl? I'm still not entirely sure I understand the concern correctly but the idea is that we send an XDS update only when the cluster changes (not every gameserver event translates to a cluster change). not sure how we'd do this without remembering the previous cluster state?
Then by periodically checking for updates to send to XDS, we have control over how often we're broadcasting to the proxies (so we batch updates to them) rather than triggering a broadcast for every gameserver change (which we can't control).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then by periodically checking for updates to send to XDS, we have control over how often we're broadcasting to the proxies (so we batch updates to them) rather than triggering a broadcast for every gameserver change (which we can't control).

Oh that is an interesting point! That being said, when would we not want it to be updated in as close to realtime as possible? 🤔

but the idea is that we send an XDS update only when the cluster changes

I didn't manage to catch this code on first pass. I'll go hunting for it again.

If we did use the event system, we could track GameServer add and delete events - which would always be a change in endpoints (we could ignore updates - as GameServer's can't change their IP or ports). On each of those events you would then know that the GameServer has been added or removed, so one could send a delta event rather than a full snapshot, without having to do your own comparison of snapshots. (or do a delta to the internal snapshot, and send at that time).

You wouldn't have to check that the stored snapshots had changed then, as you know that there is a definite change.

The downside here -- we may want to do an full update of all GameServer anyway on xDS server startup / just in case - so we may end up in the same place.

Not 100% wedded to this idea, but just wanted to explore to see if it made things easier / implementation simpler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could ignore updates - as GameServer's can't change their IP or ports

Tokens could change too which would be update events.

On each of those events you would then know that the GameServer has been added or removed, so one could send a delta event rather than a full snapshot

The xds snapshot cache itself currently can only be updated with full snapshots rather than delta events. and with the xds delta protocol too, when a new node joins we need to send a full snapshot to it first before any deltas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tokens could change too which would be update events.

Ah yep, I see that now.

The xds snapshot cache itself currently can only be updated with full snapshots rather than delta events.

Aaah, in which case, there isn't much gain from using events then. Then I recind my suggestion to use events.

I do think we should use a SharedInformerFactory - it's less setup to create, and means you can also share the cache if that becomes a requirement in the future.

I found a nice SO post on breaking down the various layers of abstraction:
https://stackoverflow.com/a/59544592

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis is looking great 😃

p.config.GameServersNamespace,
fields.Everything())

gameServerInformer := cache.NewSharedInformer(gameServerListWatch, &agonesv1.GameServer{}, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following, why do we want to track the events ourselves?

My thought with using events, rather than polling - was that then we end up with less no-change events going out to xDS subscriptions, and/or we don't have to track ourselves that changes have been made between each poll ourselves either.

where's the k8s api hammering occurring here?

Sorry, I wasn't clear -- my concern was that if we somehow misconfigured the cache (although in retrospect that seems unlikely).

The other thing I like about a SharedInformerFactory is that you won't have to do all that work to setup the cache, it will do all that for you, so I think it'll make the code simpler.

Also, a SharedInformerFactory is a standard way of implementing K8s controllers / interacting with K8s, so it's an API surface that people are used to.

@iffyio
Copy link
Collaborator Author

iffyio commented Oct 8, 2021

@markmandel I think I've addressed all the comments for this now, PTAL 🙏

Comment on lines 93 to 94
podInformer := informerFactory.Core().V1().Pods().Informer()
go podInformer.Run(ctx.Done())
Copy link
Member

@markmandel markmandel Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
podInformer := informerFactory.Core().V1().Pods().Informer()
go podInformer.Run(ctx.Done())
informerFactory.Start(ctx.Done())

Rather than starting individual informers, start the InformerFactory as a whole, and then people don't need to know to start informers as needed.

https://pkg.go.dev/k8s.io/client-go@v0.22.2/informers/internalinterfaces#SharedInformerFactory

(Also, if you want to create this in controller.go and pass it down, this should make things much easier)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really able to use Start it seems, that function doesn't work in the unit tests with the fake client

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 not quite sure what the issue here is blocking you. Can you expand?

I generally don't use NewSimpleClientset personally, I find it that it's always far more simple than I usually require when testing K8s controller code (for example, I saw the note about update not changing object generation on update).

In Agones, here is where we create our fake Factories:
https://github.com/googleforgames/agones/blob/main/pkg/testing/controller.go#L49-L62

We have a generic StartInformers function, since they always get started at the top level of the code hierarchy:
https://github.com/googleforgames/agones/blob/main/pkg/testing/controller.go#L49-L62 , and using cache.WaitForCacheSync ensures the informer/lister cache is populated.
(Also common pattern in controller's startup, but I figured we wouldn't mind if we didn't wait for sync on startup here as much).

Then when wanting to populate with data, I create a reactor, and populate it's "list" function with the requisite objects:
https://github.com/googleforgames/agones/blob/e048859c6ff3da0d4b299f915113993aa49c7865/pkg/fleets/fleets_test.go#L60-L62

Here's making things happen on update:
https://github.com/googleforgames/agones/blob/445d09d68a9a70772d8aa045f57fd09f2467eb03/pkg/fleetautoscalers/controller_test.go#L334-L340

If you want to test events (or use this inside an "update" event to maintain cache values), here's how to use a fakeWatch:
https://github.com/googleforgames/agones/blob/e048859c6ff3da0d4b299f915113993aa49c7865/pkg/gameservers/pernodecounter_test.go#L43-L44

You an see this pattern play out here as well:
https://github.com/kubernetes/sample-controller/blob/master/main.go

Interestingly - the same-controller test uses a different pattern for population, but they are using a SimpleClientSet - so that might work better in this scenario:
https://github.com/kubernetes/sample-controller/blob/master/controller_test.go#L84

Does that help unblock you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just touching base to see if this solved your issue, or maybe this is something I can help with over a video call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I haven't found time to try it out yet, will make some time soon to get back to this!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the write up! I've updated the pr now PTAL

examples/xds-management-server/pkg/filterchain/k8s/k8s.go Outdated Show resolved Hide resolved
0,
externalversions.WithNamespace(p.config.GameServersNamespace))
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer()
gameServerStore := gameServerInformer.GetStore()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I go into detail on k8s.go below (noticed it there first):

  • You can use a Lister rather than Store, and then you don't have to cast.
  • Better to start the whole InformerFactory, rather than individual informers, as then you can create the SharedInformerFactory higher up in the module hierarchy, and not have to worry about which informers have been started or not.

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the suggestions I left are about using a Lister rather than an Informer inside the Agones controller code 👍🏻

0,
externalversions.WithNamespace(flags.GameServersNamespace))
informerFactory.Start(ctx.Done())
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer()
gameServerLister := informerFactory.Agones().V1().GameServers().Lister()

Switching everything to use a lister is going to remove all the casting you do internally.

You will need to filter on each List(...) call, or use the same approach as you did below with WithTweakListOptions(...). Either way is fine as far as I am concerned.

informerFactory.Start(ctx.Done())
gameServerInformer := informerFactory.Agones().V1().GameServers().Informer()

return agonescluster.NewProvider(logger, gameServerInformer, agonescluster.Config{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return agonescluster.NewProvider(logger, gameServerInformer, agonescluster.Config{
return agonescluster.NewProvider(logger, gameServerLister, agonescluster.Config{

return k8sfilterchain.NewProvider(
logger,
clock.RealClock{},
informerFactory.Core().V1().Pods().Lister(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you use the Lister here, but not above 😄 Would be good to be consistent between the two.

type Provider struct {
config Config
logger *log.Logger
gsInformer cache.SharedIndexInformer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gsInformer cache.SharedIndexInformer
gsLister agonesv1.GameServerLister

ctx context.Context,
logger *log.Logger,
gameServersPollInterval time.Duration,
gameServerStore cache.Store,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gameServerStore cache.Store,
gsLister agonesv1.GameServerLister,


func getEndpointsFromStore(
logger *log.Logger,
gameServerStore cache.Store,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gameServerStore cache.Store,
gsLister agonesv1.GameServerLister,

logger *log.Logger,
gameServerStore cache.Store,
) []cluster.Endpoint {
gameServers := gameServerStore.List()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gameServers := gameServerStore.List()
gameServers := gsLister.List()


var endpoints []cluster.Endpoint
for i := range gameServers {
gs := gameServers[i].(*agonesv1.GameServer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gs := gameServers[i].(*agonesv1.GameServer)
gs := gameServers[i]

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 😄

@markmandel markmandel added area/examples Examples. Usually found in the `examples` directory kind/feature New feature or request labels Nov 8, 2021
@XAMPPRocky XAMPPRocky merged commit a4031ae into main Nov 10, 2021
@markmandel markmandel deleted the iu/management-server branch November 10, 2021 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/examples Examples. Usually found in the `examples` directory cla: yes kind/feature New feature or request size/xl
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a control plane
3 participants