-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keepalive bugfixes and unify timers strategies between client and server #2760
Keepalive bugfixes and unify timers strategies between client and server #2760
Conversation
- Bugfix: Ensure that if session.ping returns false we correctly identify fail the keepalive and connection - Bugfix: Ensure that if the interval between keepalives being sent occurs faster than the prior keepalive's timeout that we do not overwrite the reference to the prior timeout. Prior implementation could have in theory prevented a valid keepalive timeout from clearing itself. This rewrite keeps every timeout as a local (vs a shared state per session). Even if the timeout outlives the lifetime of a session, we still guard against errors by checking that the parent interval is not false-y. I reckon this could result in a short-term memory leak per session which is bounded for a maximum of keepaliveTimeoutMs. On the other hand even with that potential for a short reference hold, this implementation proposed here is more correct I think. One alternative we could do is keep a list of pending timeouts.. which is complex for a rare situation that will self resolve anyhow when keepaliveTimeoutMs is reached. - Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue) - Rename variables for clarity, to prevent future bugs like swapping clearInterval vs clearTimeout. - Implementation is repeated in two places, per warning from grpc#2756 (comment) - This commit supercedes the prior PR on a master branch which was out of date. grpc#2756
I realize that a second solution to my reference conundrum is to use a WeakRef to hold a reference to session and channelz tracer in the timeout lambda, which might reduce the risk of the short-term memleak size. I'm not sure if the rest of the code base is overly cautious for these types of things. If you'd prefer I track all outstanding timeouts or use a weakref, please let me know. Happy to adjust. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All per-line comments apply to both implementations.
channelzTrace.addTrace
isn't a replacement for trace
. channelzTrace.addTrace
only traces to channelz, so that will only be seen by a channelz client. You still want to call trace
to log to the console or whatever log file.
Also, keepalive logs may be verbose compared to other server logs, so I think it would be a good idea to have a separate tracer, as the client does. It's probably fine to use the same keepalive
tracer name that the client uses.
… first round of review from grpc#2760
Thanks for your prompt help and code review! I appreciate all the pointers, especially given that this is my first pass through this part of your code base.
Question:
Thanks again for the help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the channelz traces are fine the way they are.
Thank you for the review. I've rewritten to make the server code act more like the client code did for the timers. I think the only dangling issue for us to align on is the |
…epend on whether the session is destroyed
Per request, keepaliveDisabled state management is removed. |
…ger. this is a necessary change to fit with having removed keepaliveDisabled boolean. manually inspected test logs for both server.ts and transport.ts to verify both types of keepalives are operating correctly.
I've been successfully running the PR branch for many days now on a staging environment with bidi streaming RPCs that sit idle for long periods. Tweaking the keepalive values on both client and server to be both under 10 minutes or above 10 minutes, I've been able to reproduce the keepalive both (a) working to keep alive without errors, and (b) catching connections that were dropped by intermediate firewalls after a period of idle timeout ...specifically including GCP's 10 minute idle limit firewall ;) I've used |
Hi gRPC team: Gentle bump. Let me know if I can provide anything else. |
Sorry about the delay, I had to focus on GHSA-7v5v-9h63-cj86. I can get this out soon. |
This is now out in version 1.10.10. |
Improve server-side keepalives, possibly resolve bug where keepalive errors were not being treated as errors.
Aims to help resolve #2734