Skip to content

Conversation

sreeram-venkitesh
Copy link
Member

  • One-line PR description: Provide a more powerful way to express human readable printer columns exposed by CRDs than simple JSON paths.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Apr 27, 2024
@k8s-ci-robot k8s-ci-robot requested review from apelisse and sttts April 27, 2024 18:39
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 27, 2024
Copy link

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nice idea. Thanks for suggesting it.

Here's some feedback on the draft.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2024
@sreeram-venkitesh
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2024
@sreeram-venkitesh
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 2, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2025
@sreeram-venkitesh
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2025
@Priyankasaggu11929 Priyankasaggu11929 force-pushed the 4595-cel-crd-additionalprintercolumns branch from 4643db3 to 1af582b Compare May 26, 2025 14:00
@Priyankasaggu11929
Copy link
Member

@cici37 @jpbetz - we have refreshed the KEP documentation with the current content. There're a few TODO blocks - we will push them by this week itself, but overall it's complete for first review.

Could you please add the lead-opted-in label for this enhancement to be tracked for 1.34 cycle. Thanks!

@jpbetz
Copy link
Contributor

jpbetz commented May 28, 2025

/milestone v1.34
/label lead-opted-in

@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone May 28, 2025
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label May 28, 2025
@sreeram-venkitesh
Copy link
Member Author

Thank you @jpbetz! I think we added the lead-opted-in in the PR instead of the KEP issue though.

@jpbetz
Copy link
Contributor

jpbetz commented May 28, 2025

Thank you @jpbetz! I think we added the lead-opted-in in the PR instead of the KEP issue though.

Oops. I've tagged the issue now too.

Comment on lines 780 to 788
When running the benchmark tests individually, we observed that CEL was consistently ~20-50x slower than JSONPath.

Based on these current findings, the end users should not find a noticeable difference in the performance when working with CEL for additionalPrinterColumns.

<!--

Hence, the end users might start seeing a non-negligible increase in the total time and CPU usage for apiserver when they are working with a large number of Custom Resources.

-->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbetz, @cici37: Our KEP PR is ready for review!

We've done a performance analysis by running benchmark tests to measure the time taken for both CEL and JSONPath additionalPrinterColumns for the following:

  • Time taken for TableConvertor.New(), which has JSONPath parsing and CEL compilation
  • Time taken for convertor.ConvertToTable(), which has FindResults() and PrintResults()

Based on the numbers we are seeing, we assume that overall compilation times for CEL based additionalPrinterColumns might become noticeably longer than JSONPath if there are a large number of CEL based additionalPrinterColumns. We haven't ran our benchmark tests with CRDs with a large number of columns or large number of custom resources to confirm what the impact would be. What can be considered as a normal number of custom resources and additionalPrinterColumns? Also are 1000s of custom resources being in use common? (As this might impact the performance of FindResults and PrintResults)

Also, could you help us with coming up with a value for perCallLimit for the CEL compilation here? We've compared the time of compilation and evaluation for both CEL and JSONPath but we couldn't conclusively decide a value for the perCallLimit.

Thanks in advance!

Copy link
Contributor

@jpbetz jpbetz May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cel expressions can be compiled once when the CRD is first registered (to validate them), and then again when the CRD is loaded.

When serving read requests, the already compiled cel expressions can then be used.

For benchmarks, I only care about the read request latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the next step is to:

  • Benchmark an expensive JSON Path additionalPrinterColumns operation (just the part that finds a value using the JSON Path library)
  • Assume CEL operations take about 50ns each, divide to figure out how many CEL operations we could do in no more time than the JSON Path case would have taken
  • Sanity benchmark of some CEL evaluations (not compilations) to make sure our latency numbers are good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the cel compilation time would be a big concern in this feature either. The concern is focused on using cel in read path VS the existing JSON path and if there would be a significant latency added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we're compiling the expression first at CRD validation and then again inside TableConvertor.New(), which I believe is happening as part of CRD loading.

We have ran benchmarks for the FindResults() and PrintResults() calls, which would give us the read request latencies for both CEL and JSONPath. We'll update the conclusion of the benchmarking in the KEP to highlight this. Both are in the μs range with CEL being around 15 times higher (0.8μs vs 13.5μs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the data!
Regarding with the limit, we could always begin with a more strict limit/smaller number and loose it later if it's absolutely necessary. For the benchmark result, looks like on average cel is ~20-50 times slower on provided expression. Do we have a estimation on the latency it's added with different numbers of expressions? Just wanna mention that user would be able to write more expensive expression with CEL after it's introduced. Since it would be only run while converting to table, I think 10ms could be a start point. However, we may want to limit the number of expressions, estimation cost and overall cost there as well based on the analysis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a estimation on the latency it's added with different numbers of expressions?

@cici37 – do you mean to benchmark a scenario where there are multiple additional printer columns defined using CEL expressions, and record how much latency it adds?

At the moment, no, we don't have these benchmark stats, but we can add these additional scenarios and share data, but confirming if I understood the requirement correctly. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because printer columns runs on every item in a list, and there can be thousands of items in a list response, 20-50x performance costs do not seem reasonable to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out! The large list read is common and I agree that having the latency added over 20 times does not make sense. If we add timeout and fallback to not showing columns with cel expression when it takes too long, would it worth considering?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failure modes of non-deterministic list responses seems pretty questionable...

I'd expect to prereq getting CEL performance at least close to parity with jsonPath cost for this

Considering the following 10 iterations of the `Benchmark_CEL` and `Benchmark_JSONPath` tests:

```
CEL Benchmark Results (10 runs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be clear that this also includes compiling CEL, which shouldn't happen in the hot loop in production.

I see the data below that splits these out, we should be clear about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be clear that this also includes compiling CEL, which shouldn't happen in the hot loop in production.

Ack. will clarify above in the further commits.

-->

<!-- - <test>: <link to test coverage> -->
We will test all cases in integration test and unit test. If needed, we can add e2e tests before beta graduation. We are planning to extend the existing [e2e tests for CRDs](https://github.com/kubernetes/kubernetes/blob/3df3b83226530bda69ffcb7b4450026139b2cd11/test/e2e/apimachinery/custom_resource_definition.go).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, this should be fully testable with integration and unit, but we might as well extend the existing CRD tests while we're at it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.


This feature will not impact rollouts or already-running workloads.

###### What specific metrics should inform a rollback?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly api-server load? On the off-chance this is misbehaving and increases resource usage, since that's pretty much the only plausible failure mode here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Will update the above PRR question with a pointer to api-server load metrics. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenTheElder6cf3aa1 address the metrics related PRR question. Please review. Thanks!

The following table provides an average performance analysis across CEL and JSONPath based additionalPrinterColumns:


| | CEL (Benchmark_CEL) | JSONPath (Benchmark_JSONPath) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of post-processing the data, it might be better to have separate benchmarks for compile versus runtime.

Also, what hardware / architecture was this run on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of post-processing the data, it might be better to have separate benchmarks for compile versus runtime.

Ack. We are working on benchmarking more expensive use cases per above thread's discussion.
Will address adding separate compilation and runtime benchmark stats as part of the same update.

Also, what hardware / architecture was this run on?

The current benchmarks stats are from an arm64 machine (Apple M3 Pro with 12 cores, 18 GB RAM).

We also have access to an x86_64 machine (11th Gen Intel Core i7-11800H × 16 cores, 48 GB RAM) if further stats needed.

@jpbetz
Copy link
Contributor

jpbetz commented Jun 5, 2025

/approve
For PRR for alpha
For beta we will need to fully nail down the CEL cost limits, but for alpha, that's not a blocking PRR concern. Please continue the benchmark discussions though.

@cici37 Leaving LGTM with you on this one. Once you're satisfied with the details please tag to merge for 1.34. I'm OK with the high level.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2025
@Priyankasaggu11929
Copy link
Member

(I was out the whole of last week for family reasons. Just getting back to work today)

Thank you so much everyone, for helping with the review.
We are working on addressing the remaining items, and will update the PR soon.

@Priyankasaggu11929
Copy link
Member

Folks, thank you all for the discussion during the last SIG API Machinery call (June 11, 2025).

@cici37 @jpbetz @deads2k – following up, since the KEP documentation can't be merged as implementable in its current state until we get improved benchmark results for CEL (equivalent or at most 2x to the JSONPath cost).
Are there any further improvements we can make to the KEP documentation to merge as provisional for the time being while we experiment optimizing the performance of our PoC further? Thanks!

#### Alpha

- Feature implemented behind a feature flag
- Initial benchmarks to compare performance of JSONPath with CEL columns and set an appropriate CEL cost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this to reflect the multiplier agreed on with the SIG (was it 2.5x ?)

- "@jpbetz"
creation-date: "2024-04-26"
last-updated: "v1.34"
status: implementable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We COULD merge this as provisional. If we do that, the best thing to do would be to include or link to the analysis done so far so this KEP serves as a good reference for the performance challenges encountered so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to include or link to the analysis done

do you mean that we include links to the following gists, containing the benchmark test definition and results?

Or a link to a fork of k/k which contains the changes from our PoC of this KEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbetz, we have addressed all open review feedback comments in the latest 2 commits, for merging the KEP in provisional state.

Please review. Thanks!

@@ -0,0 +1,3 @@
kep-number: 4595
alpha:
approver: "@jpbetz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this PRR approval file from this PR since we're in provisional status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in recent commit.

@jpbetz
Copy link
Contributor

jpbetz commented Aug 4, 2025

/approve
/lgtm
As provisional

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, sreeram-venkitesh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 3f84dc1 into kubernetes:master Aug 4, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants