-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util: log slow GRPC calls #4847
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gman0 Thanks for the PR, some nits.
charts/ceph-csi-cephfs/README.md
Outdated
@@ -118,6 +118,7 @@ charts and their default values. | |||
| `commonLabels` | Labels to apply to all resources | `{}` | | |||
| `logLevel` | Set logging level for csi containers. Supported values from 0 to 5. 0 for general useful logs, 5 for trace level verbosity. | `5` | | |||
| `sidecarLogLevel` | Set logging level for csi sidecar containers. Supported values from 0 to 5. 0 for general useful logs, 5 for trace level verbosity. | `1` | | |||
| `logSlowOperationsAfter` | Operations running longer than the specified time duration will be logged as slow. Setting the value to zero disables the feature. | `15s` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep this value as same as GRPC timeout which we have at 60s
seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same timeout probably defeats the purpose? It would be useful to know if an operation is still running before the timeout hits.
cmd/cephcsi.go
Outdated
flag.DurationVar( | ||
&conf.LogSlowOpsAfter, | ||
"logslowopsafter", | ||
time.Second*15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to enable it by default? how about disable it by default and if the user requires they can enable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with enabling by default, it can help with troubleshooting issues.
internal/csi-common/utils.go
Outdated
info *grpc.UnaryServerInfo, | ||
handler grpc.UnaryHandler, | ||
) (interface{}, error) { | ||
ticker := time.NewTicker(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have a defer for the ticker.Stop() to avoid resource leak
internal/csi-common/utils.go
Outdated
for { | ||
select { | ||
case <-callFinished: | ||
break waitForCallFinished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we return from here we dont need to use break right?
charts/ceph-csi-cephfs/values.yaml
Outdated
@@ -40,6 +40,9 @@ commonLabels: {} | |||
logLevel: 5 | |||
# sidecarLogLevel is the variable for Kubernetes sidecar container's log level | |||
sidecarLogLevel: 1 | |||
# Operations running longer than the specified time duration will be logged | |||
# as slow. Setting the value to zero disables the feature. | |||
logSlowOperationsAfter: 15s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we comment it out and not enable it by default?
@nixpanic, actually I like @Madhu-1's idea about waiting for context timeout. It's what's happening after, when a call is retried while the older call is still in flight -- this is what I wanted to log and to have a better visibility over.
Here in the log you can see this exact situation:
Normally 27 would be blocked by a try-lock but in any case it's great that we can immediately see when a call is stuck. I propose we change this PR so that the "slow call" logs start only after a call exceeds its context deadline, and we keep this enabled by default. This way we can keep an eye on runaway GRPCs. |
I've made the changes described in the post above. I wonder if it makes sense to make the log interval configurable or just have it hardcoded to some sane value (30s might be ok)? |
internal/util/util.go
Outdated
MetricsPort int // TCP port for liveness/grpc metrics requests | ||
PollTime time.Duration // time interval in seconds between each poll | ||
PoolTimeout time.Duration // probe timeout in seconds | ||
LogSlowOpInterval time.Duration // GRPC calls running longer than this will be logged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here needs an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
You probably should consider adding a note to |
Pull request has been modified.
Two minor issues:
|
CI is passing now, thanks! PTAL |
@Mergifyio queue |
🛑 The pull request has been removed from the queue
|
@Mergifyio rebase |
This commit adds a gRPC middleware that logs calls that keep running after their deadline. Adds --logslowopinterval cmdline argument to pass the log rate. Signed-off-by: Robert Vasek <robert.vasek@clyso.com>
Signed-off-by: Robert Vasek <robert.vasek@clyso.com>
Signed-off-by: Robert Vasek <robert.vasek@clyso.com>
Signed-off-by: Robert Vasek <robert.vasek@clyso.com>
✅ Branch has been successfully rebased |
@Mergifyio refresh |
✅ Pull request refreshed |
@Mergifyio queue |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at c76338c |
/test ci/centos/upgrade-tests-cephfs |
/test ci/centos/upgrade-tests-rbd |
/test ci/centos/k8s-e2e-external-storage/1.29 |
/test ci/centos/k8s-e2e-external-storage/1.30 |
/test ci/centos/mini-e2e-helm/k8s-1.29 |
/test ci/centos/mini-e2e-helm/k8s-1.30 |
/test ci/centos/mini-e2e/k8s-1.29 |
/test ci/centos/mini-e2e/k8s-1.30 |
/test ci/centos/k8s-e2e-external-storage/1.31 |
/test ci/centos/mini-e2e-helm/k8s-1.31 |
/test ci/centos/mini-e2e/k8s-1.31 |
Describe what this PR does
Fixes #4839
This PR adds logs for slow GRPC calls. It does so by implementing a grpc middleware where it checks if the call's handler finishes after its context deadline, in which case a "slow call" message is logged.
The time duration is passed in through a new
MiddlewareServerOptionConfig struct
-- a couple of functions needed to be modified to accept a new arg.User-facing changes:
--logslowopinterval
logSlowOperationInterval
Logs with this PR included:
Is there anything that requires special attention
Do you have any questions?
Is the change backward compatible?
Are there concerns around backward compatibility?
Provide any external context for the change, if any.
For example:
Related issues
Mention any github issues relevant to this PR. Adding below line
will help to auto close the issue once the PR is merged.
Fixes: #issue_number
Future concerns
List items that are not part of the PR and do not impact it's
functionality, but are work items that can be taken up subsequently.
Checklist:
guidelines in the developer
guide.
Request
notes
updated with breaking and/or notable changes for the next major release.
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>
: retest the<job-name>
after unrelatedfailure (please report the failure too!)