-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: backup-restore/mixed-version failed #112373
Comments
This comment was marked as outdated.
This comment was marked as outdated.
If it is taking a whole minute for one intent resolution batch request to respond, arguably failing the backup is correct, since something is busted and the backup just hanging on retries is only going to hide that something. |
I don't remember adding a timeout. How do we know this timeout was due to AC queueing? Am I missing something? |
Apologies @sumeerbhola. Git blame fooled me. I no longer think the failure relates to #109932 as the limiter you added here wasn't even enabled. The timeout instead relates to #91815. I'll put this on kv's board to asses why intent resolution timed out. |
Sending this failure to kv to dig into why an export request failed due to:
Some context on this test: this backup timed out after several successful backups of the foreground workload. Note that this timeout occurred while all nodes in the cluster were running on binary version release-23.2 , but had yet to finalize to 23.2. |
|
Seeing a bunch of these right before the failure:
It's not a clear signal to me yet for anything, but I am posting it here since it seems like it could be related. |
Moving to Storage, now that AC has moved over. Please move back over if this appears to be KV related. |
@sumeerbhola thanks for taking a look!
There is a tsdump in the artifacts that Michael uploaded (last line of his latest comment). |
@nvanbenschoten how can I tell which (node,store) pair was slow to respond from these log statements?
|
Those logs statements don't include that information, as they're coming from a level where we know which range to send to, but not which replica in that range to send to. We may end up trying multiple. We see that range 171 contains: |
What would it take to improve the logging here -- knowing which node/store one was waiting on when the deadline expired would be helpful? Is there an existing issue? |
Summary: this is mostly working as intended, since intent resolution is subject to AC. The two followups for AC team are:
Details: n1's IO tokens are completely exhausted |
I'm removing the release and GA blocker labels. |
I had forgotten that we bump up the priority when doing intent resolution. This is documented in cockroach/pkg/kv/kvserver/intentresolver/admission.go Lines 79 to 87 in a96a6f6
So ExportRequest triggered intent resolution running as regular writes (and not elastic writes) is expected. |
This is now being tracked #114431, which will help us connect hanging requests to the specific replica that the request is hanging on. |
roachtest.backup-restore/mixed-version failed with artifacts on release-23.2 @ abd56c6d6e9dbf131fc1fa0d3f8d1fbcfb7e4901:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-32378
The text was updated successfully, but these errors were encountered: