-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP 4595: CEL for CRD AdditionalPrinterColumns #4602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 4595: CEL for CRD AdditionalPrinterColumns #4602
Conversation
sreeram-venkitesh
commented
Apr 27, 2024
- One-line PR description: Provide a more powerful way to express human readable printer columns exposed by CRDs than simple JSON paths.
- Issue link: https://kep.k8s.io/4595
- Other comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nice idea. Thanks for suggesting it.
Here's some feedback on the draft.
keps/sig-api-machinery/4595-cel-crd-additionalprintercolumns/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4595-cel-crd-additionalprintercolumns/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4595-cel-crd-additionalprintercolumns/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4595-cel-crd-additionalprintercolumns/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4595-cel-crd-additionalprintercolumns/README.md
Outdated
Show resolved
Hide resolved
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
4643db3
to
1af582b
Compare
/milestone v1.34 |
When running the benchmark tests individually, we observed that CEL was consistently ~20-50x slower than JSONPath. | ||
|
||
Based on these current findings, the end users should not find a noticeable difference in the performance when working with CEL for additionalPrinterColumns. | ||
|
||
<!-- | ||
|
||
Hence, the end users might start seeing a non-negligible increase in the total time and CPU usage for apiserver when they are working with a large number of Custom Resources. | ||
|
||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbetz, @cici37: Our KEP PR is ready for review!
We've done a performance analysis by running benchmark tests to measure the time taken for both CEL and JSONPath additionalPrinterColumns for the following:
- Time taken for
TableConvertor.New()
, which has JSONPath parsing and CEL compilation - Time taken for
convertor.ConvertToTable()
, which hasFindResults()
andPrintResults()
Based on the numbers we are seeing, we assume that overall compilation times for CEL based additionalPrinterColumns might become noticeably longer than JSONPath if there are a large number of CEL based additionalPrinterColumns. We haven't ran our benchmark tests with CRDs with a large number of columns or large number of custom resources to confirm what the impact would be. What can be considered as a normal number of custom resources and additionalPrinterColumns? Also are 1000s of custom resources being in use common? (As this might impact the performance of FindResults and PrintResults)
Also, could you help us with coming up with a value for perCallLimit for the CEL compilation here? We've compared the time of compilation and evaluation for both CEL and JSONPath but we couldn't conclusively decide a value for the perCallLimit.
Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cel expressions can be compiled once when the CRD is first registered (to validate them), and then again when the CRD is loaded.
When serving read requests, the already compiled cel expressions can then be used.
For benchmarks, I only care about the read request latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the next step is to:
- Benchmark an expensive JSON Path additionalPrinterColumns operation (just the part that finds a value using the JSON Path library)
- Assume CEL operations take about 50ns each, divide to figure out how many CEL operations we could do in no more time than the JSON Path case would have taken
- Sanity benchmark of some CEL evaluations (not compilations) to make sure our latency numbers are good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the cel compilation time would be a big concern in this feature either. The concern is focused on using cel in read path VS the existing JSON path and if there would be a significant latency added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we're compiling the expression first at CRD validation and then again inside TableConvertor.New()
, which I believe is happening as part of CRD loading.
We have ran benchmarks for the FindResults()
and PrintResults()
calls, which would give us the read request latencies for both CEL and JSONPath. We'll update the conclusion of the benchmarking in the KEP to highlight this. Both are in the μs range with CEL being around 15 times higher (0.8μs vs 13.5μs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the data!
Regarding with the limit, we could always begin with a more strict limit/smaller number and loose it later if it's absolutely necessary. For the benchmark result, looks like on average cel is ~20-50 times slower on provided expression. Do we have a estimation on the latency it's added with different numbers of expressions? Just wanna mention that user would be able to write more expensive expression with CEL after it's introduced. Since it would be only run while converting to table, I think 10ms could be a start point. However, we may want to limit the number of expressions, estimation cost and overall cost there as well based on the analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a estimation on the latency it's added with different numbers of expressions?
@cici37 – do you mean to benchmark a scenario where there are multiple additional printer columns defined using CEL expressions, and record how much latency it adds?
At the moment, no, we don't have these benchmark stats, but we can add these additional scenarios and share data, but confirming if I understood the requirement correctly. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because printer columns runs on every item in a list, and there can be thousands of items in a list response, 20-50x performance costs do not seem reasonable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out! The large list read is common and I agree that having the latency added over 20 times does not make sense. If we add timeout and fallback to not showing columns with cel expression when it takes too long, would it worth considering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failure modes of non-deterministic list responses seems pretty questionable...
I'd expect to prereq getting CEL performance at least close to parity with jsonPath cost for this
Considering the following 10 iterations of the `Benchmark_CEL` and `Benchmark_JSONPath` tests: | ||
|
||
``` | ||
CEL Benchmark Results (10 runs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be clear that this also includes compiling CEL, which shouldn't happen in the hot loop in production.
I see the data below that splits these out, we should be clear about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be clear that this also includes compiling CEL, which shouldn't happen in the hot loop in production.
Ack. will clarify above in the further commits.
--> | ||
|
||
<!-- - <test>: <link to test coverage> --> | ||
We will test all cases in integration test and unit test. If needed, we can add e2e tests before beta graduation. We are planning to extend the existing [e2e tests for CRDs](https://github.com/kubernetes/kubernetes/blob/3df3b83226530bda69ffcb7b4450026139b2cd11/test/e2e/apimachinery/custom_resource_definition.go). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, this should be fully testable with integration and unit, but we might as well extend the existing CRD tests while we're at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
|
||
This feature will not impact rollouts or already-running workloads. | ||
|
||
###### What specific metrics should inform a rollback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly api-server load? On the off-chance this is misbehaving and increases resource usage, since that's pretty much the only plausible failure mode here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Will update the above PRR question with a pointer to api-server load metrics. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenTheElder – 6cf3aa1 address the metrics related PRR question. Please review. Thanks!
The following table provides an average performance analysis across CEL and JSONPath based additionalPrinterColumns: | ||
|
||
|
||
| | CEL (Benchmark_CEL) | JSONPath (Benchmark_JSONPath) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of post-processing the data, it might be better to have separate benchmarks for compile versus runtime.
Also, what hardware / architecture was this run on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of post-processing the data, it might be better to have separate benchmarks for compile versus runtime.
Ack. We are working on benchmarking more expensive use cases per above thread's discussion.
Will address adding separate compilation and runtime benchmark stats as part of the same update.
Also, what hardware / architecture was this run on?
The current benchmarks stats are from an arm64 machine (Apple M3 Pro with 12 cores, 18 GB RAM).
We also have access to an x86_64 machine (11th Gen Intel Core i7-11800H × 16 cores, 48 GB RAM) if further stats needed.
/approve @cici37 Leaving LGTM with you on this one. Once you're satisfied with the details please tag to merge for 1.34. I'm OK with the high level. |
(I was out the whole of last week for family reasons. Just getting back to work today) Thank you so much everyone, for helping with the review. |
Folks, thank you all for the discussion during the last SIG API Machinery call (June 11, 2025). @cici37 @jpbetz @deads2k – following up, since the KEP documentation can't be merged as |
#### Alpha | ||
|
||
- Feature implemented behind a feature flag | ||
- Initial benchmarks to compare performance of JSONPath with CEL columns and set an appropriate CEL cost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this to reflect the multiplier agreed on with the SIG (was it 2.5x ?)
- "@jpbetz" | ||
creation-date: "2024-04-26" | ||
last-updated: "v1.34" | ||
status: implementable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We COULD merge this as provisional. If we do that, the best thing to do would be to include or link to the analysis done so far so this KEP serves as a good reference for the performance challenges encountered so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to include or link to the analysis done
do you mean that we include links to the following gists, containing the benchmark test definition and results?
- https://gist.github.com/sreeram-venkitesh/f4aff1ae7957a5a3b9c6c53e869b7403
- https://gist.github.com/Priyankasaggu11929/43cc9ece4d6215ee4cfe0d1523a919d6
Or a link to a fork of k/k which contains the changes from our PoC of this KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbetz, we have addressed all open review feedback comments in the latest 2 commits, for merging the KEP in provisional state.
Please review. Thanks!
@@ -0,0 +1,3 @@ | |||
kep-number: 4595 | |||
alpha: | |||
approver: "@jpbetz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this PRR approval file from this PR since we're in provisional status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in recent commit.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpbetz, sreeram-venkitesh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |