-
Notifications
You must be signed in to change notification settings - Fork 179
feat: add health check for epp cluster #966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add health check for epp cluster #966
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Hi @zhengkezhou1. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @danehans! can you give a look? I have no idea what went wrong, I tried many times to fix it but couldn't, here are what I observed:: Health checking failed:
A TLS connection error occurred,
NOTE: I changed the default value of
Full logs follow: epp logs
envoy logs:
|
Can you remove the TLS transport socket and retest? |
Yes its works
But there is also a new issue:
To avoid this, should we change the default value of
|
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
/ok-to-test |
The current grpc health server (port 9003) is used for kubelet liveness/readiness probes. Since kubectl does not support TLS-based gRPC probes (xref), the current health server must remain as-is. You will need to update the extproc endpoint (9002) to use grpc health checking. Here is a skeleton of what the implementation would look like: // pkg/epp/server/runserver.go
import (
healthpb "google.golang.org/grpc/health/grpc_health_v1"
"google.golang.org/grpc/health"
)
func (r *ExtProcServerRunner) AsRunnable(logger logr.Logger) manager.Runnable {
...
extProcPb.RegisterExternalProcessorServer(
srv,
extProcServer,
)
// Register the gRPC health server with the existing extproc server
healthServer := health.NewServer()
healthpb.RegisterHealthServer(svr, healthServer)
// Mark the extproc service as SERVING
svcName := extProcPb.ExternalProcessor_ServiceDesc.ServiceName
healthServer.SetServingStatus(svcName, healthpb.HealthCheckResponse_SERVING)
// Forward to the gRPC runnable.
return runnable.GRPCServer("ext-proc", srv, r.GrpcPort).Start(ctx)
} With the above changes, a health endpoint is added to the EPP ext-proc server (TCP 9002). Next, update the Envoy e2e test configmap with the static_resources:
clusters:
- name: ext_proc
type: STRICT_DNS
connect_timeout: 1s
lb_policy: LEAST_REQUEST
# Enable active health checking
health_checks:
- timeout: 2s # wait this long for a reply
interval: 10s # probe every 10s
unhealthy_threshold: 3 # 3 consecutive failures → unhealthy
healthy_threshold: 2 # 2 successes → healthy again
reuse_connection: true # keep HTTP2 conn open
grpc_health_check: # invoke gRPC Health/Check
service_name: "envoy.service.ext_proc.v3.ExternalProcessor" # Must match the service name registered by EPP
Since this is considered a breaking change, the new code should be gated by a CLI flag with it disabled by default. |
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
33217a6
to
6773192
Compare
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
tls_options: | ||
alpn_protocols: ["h2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adding the TLS options, it works now. 😅
[2025-06-15 11:08:22.735][1][debug][hc] [source/extensions/health_checkers/grpc/health_checker_impl.cc:390] [Tags: "ConnectionId":"0"] hc grpc_status=0 service_status=serving health_flags=healthy
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danehans, zhengkezhou1 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PTAL @ahg-g since you commented on the related issue. |
We need to keep this setting as-is. |
/hold cancel |
* feat: add health check for epp cluster Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * remove tls Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * don't use tls Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * health checking flag Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * fix import Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * add tls options Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> --------- Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
…e it easier to add plugins (#881) * configuration implementation (after rebase...) Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Moved plugin registry back to pkg/epp/plugins Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Removed unneeded 'forced imports' of scorers Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Changed 'profilepicker' to 'profilehandler' in new and old code Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Pass the configured SchedulingProfiles to LoadSchedulerConfig Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Ensure that both the configText and configFile flags are not specified Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Load RequestControl plugins from the configuration Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Register all plugin factories Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Review fixes Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Reverted unneeded change Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updates from review comments Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Added a stub interface for plugins to get data from the EPP Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Added a temporary implementation of plugins.Handle Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Added pluginName and plugins.Handle to plugin factory interface Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated plugin factory signatures to reflect new API Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated plugin instantiation to reflect new API Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated plugin instantiation to reflect new API Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Updated tests to reflect new API Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Do not rename the imported package Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Only upper layer of code should log errors Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Only pass what is needed to instantiate the plugins Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Review updates Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Review update Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Review update. Make more clear that the code only checks for already defined names Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * fixed e2e doc in makefile (does not require GPUs) (#976) Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> * API: Adds 5xx Status Code for Invalid ExtRef (#991) Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io> * feat(conformance): Add test for invalid EPP service reference (#959) * fix boilerplate header * add tests for InferencePoolInvalidEPPService * change to expect error on httproute refcond * moved the creation of the context to main.go. (#995) this is useful when writing a different main like llm-d, allowing to propogate the same context to the whole system. Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> * fix dead links (#989) * feat: add health check for epp cluster (#966) * feat: add health check for epp cluster Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * remove tls Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * don't use tls Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * health checking flag Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * fix import Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * add tls options Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> --------- Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * Server unit test and utility to help with such tests (#820) Signed-off-by: Ira <IRAR@il.ibm.com> * Update dynamic-lora-sidecar to expose metrics to track loaded adapters (#980) * Add a metrics to track loaded adapters * Update the sample manifests * Add explanation of metrics from dyanmic LoRA adapter sidecar * Add explanation of metrics from dyanmic LoRA adapter sidecar (take 2) * Update metrics.md based on feedback * refactor: Replace prefix cache structure with golang-lru (#928) * refactor: Replace prefix cache structure with golang-lru Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> * fix: rename prefix scorer parameters and convert test to benchmark test Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> * feat: Add per server LRU capacity Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> * fix: Fix typos and error handle Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> * fix: add safety check for LRUCapacityPerServer Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> --------- Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> * feat(conformance): Add HTTPRouteMultipleRulesDifferentPools test (#834) * copy of accepted inference pool test to start from. * add yaml file for the test * update time out * update the yaml file to add port 9002 * read timeout config from local repo * remove excess comments * correct spelling for scenarios * check route condition on RouteConditionResolvedRefs * remove empty lines in yaml * set optional/defaulted fields as unspecified * fix timeout * fix boilerplate header * change varialbe names to use primary secondary consistently. * remove extra comments * factor out common code * Add actual http traffic validation using echo-basic * remove extra comments from manifest * remove modifiedTimeoutConfig.HTTPRouteMustHaveCondition per review comment. * intermediate update * fix the test run * factor out common code * move epp def to shared manifest * remove extra comments * revert back to two epps * add to do for epp image * switch to GeneralMustHaveConditionTimeout * undo gateway version changes * remove unused HTTPRouteMustHaveConditions * update doc string for GetPod * update docstring * Remove resource type from names in manifests. * remove type from name * remove health check * add todo for combining getpod methods * configuration implementation (after rebase...) Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * After review, made code more obvious Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> * Fixed merge issues Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> --------- Signed-off-by: Shmuel Kallner <kallner@il.ibm.com> Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io> Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> Signed-off-by: Ira <IRAR@il.ibm.com> Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> Co-authored-by: Nir Rozenbaum <nirro@il.ibm.com> Co-authored-by: Daneyon Hansen <daneyon.hansen@solo.io> Co-authored-by: sina chavoshi <chavoshi@google.com> Co-authored-by: Xudong Wang <68834160+caozhuozi@users.noreply.github.com> Co-authored-by: Zhengke Zhou <madzhou1@gmail.com> Co-authored-by: Ira Rosen <irar@il.ibm.com> Co-authored-by: Shotaro Kohama <khmshtr28@gmail.com> Co-authored-by: Kfir Toledo <kfir.toledo@gmail.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Add health checking for Upstram: epp.
Fix #240