Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust xDS server #506

Closed
wants to merge 3 commits into from
Closed

Rust xDS server #506

wants to merge 3 commits into from

Conversation

XAMPPRocky
Copy link
Collaborator

The following is a minimal implementation of a State of The World (SoTW) Aggegated Discovery Service (aDS) server. This allows us to re-use types from Quilkin inside our xDS server without duplication, easily use our actual xDS server in tests, and simplify our deployments by allowing running the xDS server from the same binary as the client.

@markmandel
Copy link
Member

Curious - you thinking to replace the Go implementation, or is this just an extra way of doing a similar thing?

@XAMPPRocky
Copy link
Collaborator Author

you thinking to replace the Go implementation, or is this just an extra way of doing a similar thing?

The goal is to replace the Go implementation for a few reasons.

  • Consistency & code re-use — Most of the Go xDS server is currently recreating a lot of the concepts that exist in Quilkin, having the management server in Rust allows us to reuse a lot of functionality without any changes. This is particularly important for configuration consistency, for example as it stands the configuration for filters in the management server is slightly different from Quilkin because it relies entirely on the Protobuf definitions.

    Where as in Rust we can re-use the exact same configuration, this will also simplify a lot of user stories, as this will mean that users who have a static configuration, and want to move to using agones, can move that configuration into a configmap and have it work out of the box, rather than needing to migrate the configuration. We would also then need to have documentation teaching users how to write their configuration for xDS, or we could invest a bunch of time making it so that the configuration matches between the management server and Quilkin, but that's a lot of work and maintenance just to be consistent, our time would be better spend building new features IMO.

  • Testing Writing integration tests across different languages (particularly Go & Rust) is hard, because you're getting into distributed systems level behaviour of spawning two different servers and then needing a way to introspect their state at a given moment. We'd be able to test a lot more specific behaviour with a lot less overhead to writing tests if both of these services were in Rust, as we could easily run the request-response lifecycle procedurally, and we'd have access to their internals so we wouldn't need to add extra work to just to inspect state.

  • Better Integration I also think that having it in Quilkin provides a much better integration story for both us and users. Right now I've written the code so that integrates into the existing quilkin binary, and this provides a much better user experience, as now spawning a management server in agones is just quilkin manage agones, and spawning a management server that is watching a file and pushing updates is quilkin manage file ./path/to/config.yaml. This also simplifies a good bit of DevOps as it allows us to reuse one image deployment for both quilkin and the management server, so we can know that if you're using a given image version the management server and quilkin are guaranteed to be compatible, which today they are not.

    Again, we could invest a bunch of time and code into building a bazel/make build system that ensures that those things remain consistent, but that is investing a lot of work and maintenence just to maintain consistency, where I think that time would be better spent working on new features and improving Quilkin. This code already does most of what the current management server does and is a tenth of the size.

@github-actions github-actions bot added size/l and removed size/m labels Mar 28, 2022
@markmandel
Copy link
Member

This 100% sounds awesome. I believe the original intent to utilise Go was to take advantage of https://github.com/envoyproxy/go-control-plane - but if that's less useful here (you're digging far deeper into things than I have), that seems very reasonable.

Also - from first pass, the Rust K8s client looks really nice.

I think the whole idea of quilkin manage agones is straight awesome (this does assume they are in the same cluster?) - my only thought - I think it would be worth doing some design work in a ticket up front here, just so we can all reach consensus on the command surface and generally how it will work, and hopefully make sure no wasteful development work occurs in a direction we ultimately decide isn't going to work (or are we just mimicking what exists in Go right now?)

quilkin manage file ./path/to/config.yaml is brilliant as well. 🔥

I'm all in. Love it.

@XAMPPRocky
Copy link
Collaborator Author

I think it would be worth doing some design work in a ticket up front here, just so we can all reach consensus on the command surface and generally how it will work, and hopefully make sure no wasteful development work occurs in a direction we ultimately decide isn't going to work (or are we just mimicking what exists in Go right now?)

Yes, the current goal is only to replicate the existing behaviour of the Go xDS server so there should be nothing surprising in this implementation. Definitely open to more design discussions as we discuss more advanced features, and move past what is currently implemented in the Go xDS server, which is a very straightforward State Of The World aggregated xDS.

@XAMPPRocky XAMPPRocky marked this pull request as ready for review March 30, 2022 13:02
@XAMPPRocky
Copy link
Collaborator Author

XAMPPRocky commented Mar 30, 2022

This code is now ready for review. It implements the initial version of xDS with support for Agones to be able to run Quilkin as a cluster. There is still some more work to be done, but to not make this PR too big, we're going to do them as follow-ups, I'll mention some of the work below, which I'll turn into issues to track their progess.

The only new behaviour is that there's a new manage command to the CLI, which has one subcommand agones which runs a a Agones-based xDS server. This manage command can be extended with other different kinds of providers.

Future Work

  • Documentation There is some documentation for implementing and writing your own provider, but we need to add some new guide-level documentation for using the management server.
  • File provider — A simpler provider that allows you manage Quilkin clusters from a single configuration file. (currently being worked on by @rezvaneh)
  • Metrics & Admin — Currently this management server doesn't include metrics, ideally we'd like to have some, but it's not essential to running it, relatedly, there's no admin server because it wouldn't be doing very much without the metrics.
  • Resource specific responses Currently the management server assumes you're always asking for wildcard or "all" resources (this is also the case in the current go management server and the Quilkin xDS client.), but we should probably all you to have multiple clusters and request resources for that specific cluster or listener.

@markmandel The CI is currently failing because it cannot find OpenSSL, but AFAICT the build-image does install libssl-dev, so I'm not sure why it's not finding it.

apt-get install -y jq wget zip build-essential libssl-dev pkg-config python3-pip && \

Step #7 - "build":   run pkg_config fail: "`\"pkg-config\" \"--libs\" \"--cflags\" \"openssl\"` did not exit successfully: exit status: 1\nerror: could not find system library 'openssl' required by the 'openssl-sys' crate\n\n--- stderr\nPackage openssl was not found in the pkg-config search path.\nPerhaps you should add the directory containing `openssl.pc'\nto the PKG_CONFIG_PATH environment variable\nNo package 'openssl' found\n"
Step #7 - "build": 
Step #7 - "build":   --- stderr
Step #7 - "build":   thread 'main' panicked at '
Step #7 - "build": 
Step #7 - "build":   Could not find directory of OpenSSL installation, and this `-sys` crate cannot
Step #7 - "build":   proceed without this knowledge. If OpenSSL is installed and this crate had
Step #7 - "build":   trouble finding it,  you can set the `OPENSSL_DIR` environment variable for the
Step #7 - "build":   compilation process.
Step #7 - "build": 
Step #7 - "build":   Make sure you also have the development packages of openssl installed.
Step #7 - "build":   For example, `libssl-dev` on Ubuntu or `openssl-devel` on Fedora.
Step #7 - "build": 
Step #7 - "build":   If you're in a situation where you think the directory *should* be found
Step #7 - "build":   automatically, please open a bug at https://github.com/sfackler/rust-openssl
Step #7 - "build":   and include information about your system as well as this message.
Step #7 - "build": 
Step #7 - "build":   $HOST = x86_64-unknown-linux-gnu
Step #7 - "build":   $TARGET = x86_64-unknown-linux-gnu
Step #7 - "build":   openssl-sys = 0.9.72
Step #7 - "build": 
Step #7 - "build":   ', /cargo/registry/src/github.com-1ecc6299db9ec823/openssl-sys-0.9.72/build/find_normal.rs:180:5
Step #7 - "build":   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Step #7 - "build": warning: build failed, waiting for other jobs to finish...
Step #7 - "build": error: build failed
Step #7 - "build": make: *** [Makefile:94: build-linux-binary] Error 101
Finished Step #7 - "build"
ERROR

@markmandel
Copy link
Member

The CI failure is a hilarious amount of fun.

TL:DR - cross deliberately removed OpenSSL:

You can reproduce by using make build-linux-binary.

The mac build with rust-linux-darwin-builder also fails, but I think that might just be pointing to the wrong OpenSSL library.

I'll dig in, see what I can work out.

@markmandel
Copy link
Member

Just testing a solution with the openssl = { version = "0.10", features = ["vendored"] } feature, and so far, it's working for the linux build under cross. Attempting the rest now.

@markmandel
Copy link
Member

Okay, that worked with linux and windows - gotta work out what's going on with macos, that's still not working.

@markmandel
Copy link
Member

TIme to for me to finish up for the evening, but I think I've got all the platforms compiling 😄 . I'll do some extra testing tomorrow and tidy it up.

If you want to take it for a spin, you can see it all here: https://github.com/markmandel/quilkin/tree/pr/rf/xds-server-rust

@XAMPPRocky XAMPPRocky force-pushed the rf/xds-server-rust branch 2 times, most recently from 4e2ec87 to f69c0d4 Compare March 31, 2022 08:35
Copy link
Contributor

@bombsimon bombsimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Mostly curious of how we want to work with the Agones SDK and common types. Would be nice to not have to duplicate if we don't have to, even if it's not that much code.

src/xds.rs Outdated Show resolved Hide resolved
src/xds/provider/agones.rs Show resolved Hide resolved
@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot added size/xl and removed size/l labels Mar 31, 2022
#[clap(
short,
long,
default_value = "gameservers",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we set these both to default for both. That namespace is the default, in a Kubernetes cluster - so if people are testing this out, it probably makes the most sense to point it there, since I expect people will dump everything in there sometimes anyway (or at least on first attempt).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me, these are just what's in the current xDS implementation.

src/main.rs Show resolved Hide resolved
src/xds/server.rs Outdated Show resolved Hide resolved
group = "agones.dev",
version = "v1",
kind = "GameServer",
status = "GameServerStatus",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we don't have an actual k8s/status value for GameServer (for our use case it reduces hits to the API server) - but since we're not updating a GameServer, it likely doesn't matter for our use case (I'm guessing it might be required for this build toolchain? 🤔)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, his is correct, we have a status value - but it's not registered a a /status subresource on GameServer, because reasons.

Details: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#subresources

(Basically it's a nice way of updating only a section of a CRD - but the way we treat GameServers in Agones, it's better if we update GameServer's entire value as a single transaction, so we don't do this for this resource)

If it's working for this, it doesn't actually matter - it might have been an issue if we were trying to update /status values on the GameSever as it might use the wrong API.

Mostly this is just a point of interest, and not a blocking issue in the PR.

src/xds/provider/agones.rs Show resolved Hide resolved
Comment on lines +35 to +36
config: kube::Api<k8s_openapi::api::core::v1::ConfigMap>,
gameservers: kube::Api<GameServer>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using kube::Api, we should use Reflectors instead, such that we have a local cache that is updated on changes, that we can poll for free with no performance hit to the K8s API control plane (In client-go we use Informers and Listeners, but the concept is the same).

Constant polling against the K8s API server is unfortunately a great way to make the API control plane (and etcd) really upset and will cause the performance of the entire cluster to plummet.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

1 similar comment
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

XAMPPRocky and others added 3 commits April 26, 2022 08:03
Co-Authored-By: rezvaneh <rezvaneh@users.noreply.github.com>
Co-Authored-By: markmandel <markmandel@users.noreply.github.com>
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@quilkin-bot
Copy link
Collaborator

Build Failed 😭

Build Id: eabd1c35-9e7c-4077-94e2-890c6f6593d0

Status: FAILURE

To get permission to view the Cloud Build view, join the quilkin-discuss Google Group.

@XAMPPRocky
Copy link
Collaborator Author

Closing in favour of #527

@XAMPPRocky XAMPPRocky closed this May 9, 2022
@markmandel markmandel deleted the rf/xds-server-rust branch August 4, 2022 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants