Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4963: Kube-proxy Services Acceleration #4964

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aojea
Copy link
Member

@aojea aojea commented Nov 15, 2024

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 15, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Nov 15, 2024
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 15, 2024
@aojea aojea changed the title KEP-4963: use flowtables to accelerate kube-proxy WIP - KEP-4963: use flowtables to accelerate kube-proxy Nov 15, 2024
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 15, 2024
@aojea aojea marked this pull request as draft November 15, 2024 05:01
@aojea aojea mentioned this pull request Nov 15, 2024
4 tasks
@aojea aojea changed the title WIP - KEP-4963: use flowtables to accelerate kube-proxy KEP-4963: Kube-proxy Services Acceleration Nov 28, 2024
@aojea aojea marked this pull request as ready for review November 28, 2024 12:10
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 28, 2024
@aojea
Copy link
Member Author

aojea commented Nov 28, 2024

/assign @thockin @danwinship

I leave up to you if is worth to go through the feature gate process if this is an opt-in option

@adrianmoisey
Copy link
Member

I like this proposal. Giving users the choice to opt in makes a lot of sense and reduces risk.

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I do like the idea of using flowtables in Kubernetes, though I'm not sure we've figured out all of the details of how it makes the most sense yet...


Users will be able to opt-in to Service traffic acceleration by passing a CEL expression using the flag `--accelerated-interface-expression` or the configuration option `AcceleratedInterfaceExpression` to match the network interfaces in the node that are subjet to Service traffic acceleration. The absence of a CEL expression disables the feature.

Kube-proxy will create a `flowtable` in the kube-proxy table with the name `kube-proxy-flowtable` and will monitor the network interfaces in the node to populate the `flowtable` with the interfaces that match the configured CEL expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we need to investigate the performance implications of this, since it seems clear that this isn't really how flowtables were meant to be used, and there could potentially be things that are O(n^2) in the length of devices.

(Also, I wouldn't be surprised if it turned out that recreating the flowtable with an updated devices value discards all existing flow offload entries. This isn't horrible since the offload will just get recreated on the next packet, but it's not ideal and suggests maybe we should be working with the netfilter team to optimize this all a bit more...)

(This is maybe a topic for "Risks and Mitigations".)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I also need to consult this with the netfilter folks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we need to investigate the performance implications of this, since it seems clear that this isn't really how flowtables were meant to be used, and there could potentially be things that are O(n^2) in the length of devices.

Regarding "since it seems clear that this isn't really how flowtables were meant to be used", could you expand on that? I'm unsure what the intended use of flowtables is, and how this use case is unintended.

I'm assuming this may be written down somewhere, so happy if you point me at a readme or similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really written down, and I don't know exactly what the model is. But if they expected people to set up flows between every pair of devices on the node, then they wouldn't have required you to enumerate specific devices, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if they expected people to set up flows between every pair of devices on the node, then they wouldn't have required you to enumerate specific devices, right?

AFAIK the flowtable (cache table) is per device, so this may be linking the specific flow to the specific device


### Upgrade / Downgrade Strategy

kube-proxy reconcile the nftables rules so the rules will be reconciled during startup and added or removed depending on how kube-proxy is configured.
Copy link
Contributor

@danwinship danwinship Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW when downgrading, any existing flow table entries will be preserved, so any Service connections that remain up across a kube-proxy downgrade would still be offloaded and there'd be no way to un-offload them with the older kube-proxy.

(That could be mitigated by recommending that people restart kube-proxy with the feature disabled before downgrading.)

(Except that wouldn't work either; you need to delete the flowtable at startup if the feature is disabled.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I need to document and understand this better, if there is a way to cleanup we already have a kube-proxy cleanup command so I wonder if we can reuse it for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about downgrading to a release that doesn't know anything about flowtables, but I guess that's only a problem for people using it during Alpha.

Anyway, you just need to make it so that if you start nftables kube-proxy in a version that knows about flowtables, but with flowtables disabled, then it should ensure the flow table doesn't exist when it does its first sync rather than ensuring that it does exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship I see now in the code that we flush chain by chain instead of the whole table when starting ... is there any reason for doing that? flushing the whole table allow to handle downgrade/upgrade of the table schema without further considerations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first wrote the code, I thought nft flush table ip kube-proxy would flush the contents of sets too, which would mean that restarting kube-proxy would break any active affinity. But it turns out that it doesn't actually do that anyway; flushing a table just flushes (but does not delete) every chain within the table.

Copy link
Member Author

@aojea aojea Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, should we change kube-proxy to flush the entire table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush still keeps the flowtable , so it is not valid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think if is possible to have an integrity check 🤔 and do a full resync

Use the kernel flowtables infrastructure to allow kube-proxy users to
accelerate service traffic.

Change-Id: Iee638c8e86a4d17ddbdb30901b4fb4fd20e7dbda

The [elephant flow detection is a complex topic with a considerable number of literature about it](https://scholar.google.pt/scholar?q=elephant+flow+detection&hl=es&as_sdt=0&as_vis=1&oi=scholart). For our use case we proposa a more simplistic approach based on the number of packets, so it can give us good trade off between performance improvement and safety, we want to avoid complex heuristics and have a more predictible and easy to think about behavior based on the lessones learned from [KEP-2433 Topology aware hints](https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2433-topology-aware-hints#proportional-cpu-heuristic).

We chose 20 as the number of packets as the threshold to offload, based on existing thresold used by Cisco systems in [Cisco Application Centric Infrastructure](https://scholar.google.com/scholar_lookup?hl=en&publication_year=2018&author=G.+Tam&title=Cisco+Application+Centric+Infrastructure), Cisco referred to an elephant flow if the flow contains more than 15 packet sizes, i.e., short flow is less than 15 packets, we add 5 packets of buffer to be on the safe side. This means that, using TCP as an example, and assuming an MTU of 1500 bytes and removing the overhead of the TCP headers (that can vary from 20-60 bytes, use 40 for this example), offloading will benefit workloads that transmit more than: TCP Payload * Number of packets = 1460 bytes/packet * 20 = 29200 bytes.
Copy link
Member

@adrianmoisey adrianmoisey Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems sensible, and I like that it uses the lessons of users to inform how Kubernetes could operate.
What I'm wondering, is there a disadvantage to stream of 21 packets in length?

I assume this implementation will mean that packets 1-20 will go via the normal path, then packet 21 will be offloaded to the fast path.

I'm just wondering what happens, to performance, if the stream is offloaded to the fast path just before it ends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering what happens, to performance, if the stream is offloaded to the fast path just before it ends.

The great thing of IP is that is independent of the path, same as you go to google.com through a lot of devices and links, from one pod or the other that the packet takes does not matter unless there are operations on the packet ... but that should be extraordinary and that is why we proposal a knob to let users to completely disable this behavior

If you see the netperf test below, they are exercising:

  1. no offload
  2. offload after connection established
  3. offload after packet number 20

and the performance improvement is still considerable, since only a max of 29200 bytes or 20 packets are hitting the slow path ... of course the larger your stream of data the closer to the asymptotic limit

Co-authored-by: Adrian Moisey <adrian@changeover.za.net>
@adrianmoisey
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants