Logical optimization of provider kubernetes #920

zhshw · 2023-01-16T05:56:26Z

I have forked the latest code on github, I will use my experience to solve this problem...

Problems:

It is uncontrollable that all relevant resources are processed in the Reconcile function.
1. very slow startup, large amount of repeated processing #789
No split resource processing logic, update of endpoints is difficult to implement, Blocking to implement the slowStart、lbPolicy feature
1. Resolve Service to Endpoints IPs #35
2. EDS + endpoint slow start Support #787

The root cause is that the SDK provided by the controller-runtime is not suitable for complex K8S controller projects, It is terrible logic to use a Single Reconcile for multiple resources. The logic is bloated and difficult to handle.

Todo List:

Replace Single Reconcile And Dynamic client , Use Informer event-handler
- Different resources, different work queues
- Split processing logic （EDS、CDS、RDS、LDS、VHDS、HDS）, Solve repeated processing
- Use lister cache instead of remote request
Support EDS, K8S endpoints update And resovle service to endpoints （easy to implement）
Support K8S Event Record
Support more data validation Necessary data check of envoy listener #854
Use proto message Equal to skip duplicate data

Reference :

youngnick · 2023-01-17T22:55:50Z

Thanks for this issue, @zhshw.

The maintainers of this project have all built different Kubernetes controllers before, so I think that you're missing some context here.

The single reconciler pattern is very important for a complex, interrelated set of resources like Gateway API, as changes in one resource can mean that other resources also need to be reprocessed. Doing this with separate reconcilers actually creates lots of individual reconcile events (it's quite common for a Route update to require a Gateway re-reconcile, which can then trigger re-reconciles for other Route objects, for example).

We actually did start out with separate reconcilers, but have ended up folding them back into one because of issues like this.

I agree that speeding up endpoint reconciliation is an important goal for operating EG at scale, but to this point, we have been concentrating on getting the basic functionality working, rather than scale testing. If you have numbers about scale testing you can share, that would be excellent, and a great place to start this conversation.

Finally, I guess you didn't intend this, but the way that this issue is written, it is implying something like "the current maintainers are all stupid, we should do this the right way". As I said earlier, all the maintainers have built controllers before, and are solving the problems that they have had before. If you've had a different experience, that's valuable information, but an approach centered around asking why things are the way they are first might be better received in the future.

arkodg · 2023-01-18T02:28:20Z

adding to what @youngnick said, here's a GH issue that introduced merging controllers #413 which has more info on the WHY

zhshw · 2023-01-18T02:47:12Z

@youngnick @arkodg

There is a problem, we should solve it instead of worrying about why it was designed in the past. The data structure can always be optimized, and there is always a better way to solve problems, The starting point of everything is to solve problems

Of course, we can see that the test results are not ideal. My test is only based on a small amount of data, which causes problems (a lot of repeated processing). In fact, many resource data changes do not need to be notified at the top, such as route weight change, endpoint , lbpolicy and route timeout...

I will test larger data scale in the future. If there is no design verification of large-scale data, it is equivalent to building a high platform with drift sand.

Referring to other open source envoy-control-plane projects(istio 、contour、 gloo) and Kubernertes controllers, none of them adopts Single Reconcile

Another way:

Parent-child relationship can be split by bottom-up data assembly. assemble bottom data through multiple queues in parallel, Finally, push the top data and new the version
XDS snapshot Empty resource version will not push data

Years of experience is that only simpler logic can guarantee the maintainability and performance of the project.

One resource, one queue
Independent logic, unit processing

In this way, the code will be less and the logical unit will be simpler. The new data structure can support larger scale tests

zhshw added the kind/enhancement New feature or request label Jan 16, 2023

zhshw closed this as completed Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logical optimization of provider kubernetes #920

Logical optimization of provider kubernetes #920

zhshw commented Jan 16, 2023 •

edited

Loading

youngnick commented Jan 17, 2023

arkodg commented Jan 18, 2023

zhshw commented Jan 18, 2023 •

edited

Loading

Logical optimization of provider kubernetes #920

Logical optimization of provider kubernetes #920

Comments

zhshw commented Jan 16, 2023 • edited Loading

youngnick commented Jan 17, 2023

arkodg commented Jan 18, 2023

zhshw commented Jan 18, 2023 • edited Loading

zhshw commented Jan 16, 2023 •

edited

Loading

zhshw commented Jan 18, 2023 •

edited

Loading