Skip to content

Conversation

@jukie
Copy link
Contributor

@jukie jukie commented Sep 10, 2025

What this PR does / why we need it:

  • Optimize controller cache to only cache EnvoyProxy pods. There's a race condition between cache sync and the topology injector running and in large clusters ~5% of the time the caches aren't synced in time.
  • If cach.Get() fails, run get against API server directly.

Which issue(s) this PR fixes:

Fixes #6932

Release Notes: Yes

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie jukie marked this pull request as ready for review September 10, 2025 21:22
@jukie jukie requested a review from a team as a code owner September 10, 2025 21:22
@jukie jukie requested a review from a team September 10, 2025 21:23
@jukie jukie added this to the v1.6.0-rc.1 Release milestone Sep 10, 2025
@codecov
Copy link

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 38.88889% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.03%. Comparing base (80d00a0) to head (f6ea9d5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/topology_injector.go 0.00% 7 Missing ⚠️
internal/provider/kubernetes/kubernetes.go 63.63% 4 Missing ⚠️

❌ Your patch status has failed because the patch coverage (38.88%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6936   +/-   ##
=======================================
  Coverage   71.03%   71.03%           
=======================================
  Files         226      226           
  Lines       40026    40037   +11     
=======================================
+ Hits        28431    28442   +11     
- Misses       9911     9914    +3     
+ Partials     1684     1681    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie jukie force-pushed the topology-webhook-cache-bug branch from de74d1c to 0233919 Compare September 10, 2025 22:06
@jukie jukie requested a review from arkodg September 10, 2025 22:07
jukie and others added 2 commits September 10, 2025 16:17
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
@jukie jukie force-pushed the topology-webhook-cache-bug branch from 5b85f3f to af47036 Compare September 10, 2025 23:15
arkodg
arkodg previously approved these changes Sep 10, 2025
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jukie can you share any data on mem reduction

@arkodg arkodg requested review from a team September 10, 2025 23:57
@jukie
Copy link
Contributor Author

jukie commented Sep 11, 2025

The cache can probably be optimized even further and drop other unused resources completely or use similar filtering. In the case of pods the memory savings would in theory scale based on the size of the cluster.

I can share more data tomorrow once I run this in a cluster actually hitting the issue but in a staging cluster running this patch it's down from ~570Mi --> 540Mi.

Edit: that doesn't seem accurate so will probably need to let it run longer.

@jukie jukie requested a review from arkodg September 11, 2025 01:12
@arkodg arkodg requested review from a team September 11, 2025 02:27
@zirain zirain enabled auto-merge (squash) September 11, 2025 02:51
@jukie
Copy link
Contributor Author

jukie commented Sep 11, 2025

/retest

@zirain zirain merged commit da0d913 into envoyproxy:main Sep 11, 2025
50 of 53 checks passed
@jukie jukie deleted the topology-webhook-cache-bug branch September 11, 2025 06:50
@jukie
Copy link
Contributor Author

jukie commented Sep 11, 2025

Could the main pipeline for da0d913 be retried for the docker push please?

zirain added a commit to zirain/gateway that referenced this pull request Sep 16, 2025
* Optimize pod cache

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* release note

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* Remove retry

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* cleanup

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

---------

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: Isaac <10012479+jukie@users.noreply.github.com>
Co-authored-by: zirain <zirain2009@gmail.com>
zirain added a commit to zirain/gateway that referenced this pull request Sep 16, 2025
* Optimize pod cache

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* release note

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* Remove retry

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* cleanup

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

---------

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: Isaac <10012479+jukie@users.noreply.github.com>
Co-authored-by: zirain <zirain2009@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>
zirain added a commit to zirain/gateway that referenced this pull request Sep 16, 2025
* Optimize pod cache

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* release note

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* Remove retry

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* cleanup

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

---------

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: Isaac <10012479+jukie@users.noreply.github.com>
Co-authored-by: zirain <zirain2009@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>
zirain added a commit that referenced this pull request Sep 16, 2025
* fix: cluster stat name: lowercase Kind (#6780)

cluster stat name: lowercase Kind

Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: envoy service cluster name for zone-aware routing (#6763)

* fix!: fix envoy service cluster name for zone-aware routing

Signed-off-by: y-rabie <youssef.rabie@procore.com>

* extend e2e tests for zone aware routing

Signed-off-by: y-rabie <youssef.rabie@procore.com>

* extend unit tests for zone aware routing

Signed-off-by: y-rabie <youssef.rabie@procore.com>

---------

Signed-off-by: y-rabie <youssef.rabie@procore.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* conformance: update experimental test report (#6782)

* conformance: update experimental test report

Signed-off-by: zirain <zirain2009@gmail.com>

* fix version

Signed-off-by: zirain <zirain2009@gmail.com>

* fix(api): image validation regex, support port in repository (#6819)

fix: match repository in image with port

Signed-off-by: Windfarer <windfarer@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: Actually update xdsIR with maxAcceptPerSocketEvent (#6834)

* Actually update xdsIR with maxAcceptPerSocketEvent

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* release note

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* newline lint

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

---------

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* bugfix: fix the topologyInjectorDisabled and the local cluster was not defined (#6847)

* bugfix: fix the topologyInjectorDisabled and the local cluster was not defined.

Signed-off-by: qicz <qiczzhu@gmail.com>

* fix ut

Signed-off-by: qicz <qiczzhu@gmail.com>

* add topology-injector-enabled ut

Signed-off-by: qicz <qiczzhu@gmail.com>

* add release note

Signed-off-by: qi <qiczzhu@gmail.com>

---------

Signed-off-by: qicz <qiczzhu@gmail.com>
Signed-off-by: qi <qiczzhu@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix(logging): correct log formatting to avoid DPANIC in controller-runtime logger (#6846)

* Update filters.go

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* add release notes

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

---------

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: handle context errors as transient errors (#6850)

* handle context errors as transient errors

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* add test cases

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* no need the new line

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* add release notes

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* Return the error as is

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* revert redundant changes

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* revert unrelated changes

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

* revert more changes...

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>

---------

Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>
Co-authored-by: zirain <zirain2009@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* bugfix: the controller cannot read the EnvoyProxy attached gatewayclass only. (#6838)

Signed-off-by: zirain <zirain2009@gmail.com>

* chore: fix CVE (#6903)

Signed-off-by: Shahar Harari <shahar.harari@sap.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: nil pointer dereference in btp configmap indexer (#6921)

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* improve targetRef selection for targetSelectors (#6917)

* improve targetRef selection for targetSelectors

* only select refs in the same namespace as the policy

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* fix lint

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: suppress lua validation logs (#6929)

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: rm incorrectly set exclusiveMaximum field in CRD (#6926)

* fix: rm incorrectly set exclusiveMaximum field in CRD

* Also fix maximum value to 599 which includes 599 as a valid num

Fixes: #6925

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: rm Strict SameSite default (#6941)

* by default it should be unset which implies `Lax`

Relates to #6347

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: zirain <zirain2009@gmail.com>

* Optimize pod cache (#6936)

* Optimize pod cache

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* release note

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* Remove retry

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

* cleanup

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

---------

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: Isaac <10012479+jukie@users.noreply.github.com>
Co-authored-by: zirain <zirain2009@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* reduce DeepCopy in gateway-api layer (#6940)

* reduce deep copy in gateway-api layer

* also fixed the DeepCopy implementation for ControllerResources
which was performing a Shallow Copy resulting it lack of isolation
b/w provider and gateway-api layer

Relates to #6919

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* fix lint

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: validation for grpc routes with extension ref filters (#6949)

Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: cleanup dangling route status conditions (#6812)

Signed-off-by: y-rabie <youssef.rabie@procore.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* Fix: Add missing patch annotations to Compression struct for proper Merge (#6951)

* fix: merge compression annotation

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

* test: add more compression merge test cases

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>

---------

Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* fix: update distroless image to resolve glibc CVEs (#6953)

Signed-off-by: Shahar Harari <shahar.harari@sap.com>
Signed-off-by: zirain <zirain2009@gmail.com>

* chore: bump golang to 1.24.7 (#6959)

chore: bump golang

Signed-off-by: zirain <zirain2009@gmail.com>

* fix: Make sure proxy protocol filter is the first listener filter (#6972)

Fixes: #6873

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Co-authored-by: Jacob Neil Taylor <me@jacobtaylor.id.au>
Signed-off-by: zirain <zirain2009@gmail.com>

* release notes

Signed-off-by: zirain <zirain2009@gmail.com>

* Removes reflection from RouteContext to reduce allocations (#6820)

* bench: adds APIToXDS bench & small opt

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

* no refect

goos: darwin
goarch: arm64
pkg: github.com/envoyproxy/gateway/test/gobench
cpu: Apple M1 Pro
                          │   old.txt    │               new.txt               │
                          │    sec/op    │   sec/op     vs base                │
GatewayAPItoXDS/small-10    881.2µ ±  7%   803.4µ ± 1%   -8.82% (p=0.000 n=10)
GatewayAPItoXDS/medium-10   4.130m ± 26%   2.959m ± 4%  -28.36% (p=0.000 n=10)
GatewayAPItoXDS/large-10     5.375 ±  2%    4.553 ± 1%  -15.28% (p=0.000 n=10)
geomean                     26.94m         22.12m       -17.90%

                          │   old.txt    │               new.txt                │
                          │     B/op     │     B/op      vs base                │
GatewayAPItoXDS/small-10    507.2Ki ± 0%   492.9Ki ± 0%   -2.83% (p=0.000 n=10)
GatewayAPItoXDS/medium-10   2.545Mi ± 7%   1.954Mi ± 2%  -23.21% (p=0.000 n=10)
GatewayAPItoXDS/large-10    2.832Gi ± 0%   2.831Gi ± 0%        ~ (p=0.529 n=10)
geomean                     15.40Mi        13.97Mi        -9.31%

                          │   old.txt   │               new.txt               │
                          │  allocs/op  │  allocs/op   vs base                │
GatewayAPItoXDS/small-10    8.328k ± 0%   8.017k ± 0%   -3.73% (p=0.000 n=10)
GatewayAPItoXDS/medium-10   39.45k ± 6%   29.74k ± 2%  -24.60% (p=0.000 n=10)
GatewayAPItoXDS/large-10    38.75M ± 0%   38.71M ± 0%   -0.11% (p=0.000 n=10)
geomean                     233.5k        209.8k       -10.16%

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

* removes garbage

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

* more tests

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

* more tests

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

---------

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

---------

Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: y-rabie <youssef.rabie@procore.com>
Signed-off-by: Windfarer <windfarer@gmail.com>
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: qicz <qiczzhu@gmail.com>
Signed-off-by: qi <qiczzhu@gmail.com>
Signed-off-by: TomerJLevy <TomerJlevy@gmail.com>
Signed-off-by: Shahar Harari <shahar.harari@sap.com>
Signed-off-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: Isaac <10012479+jukie@users.noreply.github.com>
Signed-off-by: sudipto baral <sudiptobaral.me@gmail.com>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Co-authored-by: Guy Daich <guy.daich@sap.com>
Co-authored-by: Youssef Rabie <youssef.rabie@procore.com>
Co-authored-by: Windfarer <windfarer@gmail.com>
Co-authored-by: Isaac <10012479+jukie@users.noreply.github.com>
Co-authored-by: qi <qiczzhu@gmail.com>
Co-authored-by: TomerJLevy <TomerJlevy@gmail.com>
Co-authored-by: shahar-h <shahar.harari@sap.com>
Co-authored-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Co-authored-by: Arko Dasgupta <arkodg@users.noreply.github.com>
Co-authored-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Co-authored-by: Jacob Neil Taylor <me@jacobtaylor.id.au>
Co-authored-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: topologyInjector race condition

3 participants