roachtest: backup-restore/mixed-version failed #112373

cockroach-teamcity · 2023-10-14T10:22:14Z

roachtest.backup-restore/mixed-version failed with artifacts on release-23.2 @ abd56c6d6e9dbf131fc1fa0d3f8d1fbcfb7e4901:

(mixedversion.go:545).Run: mixed-version test failure while running step 29 (run "plan and run backups"): error waiting for job to finish: job 908356280488263681 failed with error: failed to run backup: exporting 230 ranges: export request timeout: operation "intent_resolver_ir_batcher.sendBatch" timed out after 1m0.041s (given timeout 1m0s): work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s
test artifacts and logs in: /artifacts/backup-restore/mixed-version/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: backup-restore/mixed-version failed [descriptor version conflict when RESTORE and 22.2 upgrade overlaps] #109976 roachtest: backup-restore/mixed-version failed [descriptor version conflict when RESTORE and 22.2 upgrade overlaps] [C-test-failure O-roachtest O-robot T-sql-foundations branch-release-22.2]

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-32378}

The text was updated successfully, but these errors were encountered:

dt · 2023-10-16T17:41:37Z

A whole backup should not fail on this error

If it is taking a whole minute for one intent resolution batch request to respond, arguably failing the backup is correct, since something is busted and the backup just hanging on retries is only going to hide that something.

sumeerbhola · 2023-10-17T14:57:26Z

A backup export request failed due to an admission control timeout recently introduced in #109932 by @sumeerbhola

I don't remember adding a timeout. How do we know this timeout was due to AC queueing? Am I missing something?
This seems like another place where we need the observability mentioned in #112126 to understand the causes of the latency incurred by the request.
@adityamaru

msbutler · 2023-10-17T17:23:15Z

Apologies @sumeerbhola. Git blame fooled me. I no longer think the failure relates to #109932 as the limiter you added here wasn't even enabled. The timeout instead relates to #91815. I'll put this on kv's board to asses why intent resolution timed out.

msbutler · 2023-10-17T17:26:20Z

Sending this failure to kv to dig into why an export request failed due to:

operation "intent_resolver_ir_batcher.sendBatch" timed out after 1m0.041s (given timeout 1m0s): work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s

Some context on this test: this backup timed out after several successful backups of the foreground workload. Note that this timeout occurred while all nodes in the cluster were running on binary version release-23.2 , but had yet to finalize to 23.2.

sumeerbhola · 2023-10-18T18:56:35Z

I no longer think the failure relates to #109932 as the limiter you added here wasn't even enabled.

kv.intent_resolver.batch.bypass_admission_control.enabled defaults to false, so AC is enabled for most intent resolution.

aadityasondhi · 2023-10-24T15:55:23Z

Seeing a bunch of these right before the failure:

I231014 10:21:05.508193 671447 server/node.go:1368 ⋮ [T1,Vsystem,n3] 1491 +     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch
W231014 10:21:05.711642 600806 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1492  pusher ‹3b1b0f28›: have been waiting 60.01s for pushee ‹fdfe2850›
W231014 10:21:05.920440 601174 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1493  pusher ‹ff90cb52›: have been waiting 60.00s for pushee ‹fdfe2850›
W231014 10:21:06.045302 601168 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1494  pusher ‹6bc668a4›: have been waiting 60.00s for pushee ‹639140cd›
W231014 10:21:06.262348 601265 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1495  pusher ‹c8addbf1›: have been waiting 60.14s for pushee ‹7499a0b9›
W231014 10:21:06.262425 601318 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1496  pusher ‹2c283425›: have been waiting 60.12s for pushee ‹3555f153›
W231014 10:21:06.287687 601531 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1497  pusher ‹bddece9c›: have been waiting 60.00s for pushee ‹4f16652a›
W231014 10:21:06.478808 601750 kv/kvserver/txnwait/queue.go:619 ⋮ [T1,Vsystem,n3,s3,r67/3:‹/Table/10{6/1-7}›] 1498  pusher ‹d3d77518›: have been waiting 60.00s for pushee ‹4f16652a›

It's not a clear signal to me yet for anything, but I am posting it here since it seems like it could be related.

nvanbenschoten · 2023-10-25T16:02:50Z

Moving to Storage, now that AC has moved over. Please move back over if this appears to be KV related.

miraradeva · 2023-11-08T14:10:01Z

@sumeerbhola thanks for taking a look!

If we have a tsdump we could probably make an educated guess.

There is a tsdump in the artifacts that Michael uploaded (last line of his latest comment).

sumeerbhola · 2023-11-10T22:41:29Z

@nvanbenschoten how can I tell which (node,store) pair was slow to respond from these log statements?

W231014 10:20:57.138825 310484 kv/kvclient/kvcoord/dist_sender.go:1887 ⋮ [T1,Vsystem,n4,s4] 321  slow range RPC: have been waiting 60.04s (1 attempts) for RPC ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹2›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹3›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹4›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹5›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹6›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹7›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹8›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹9›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹10›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹11›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹12›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹13›/‹0›,/Min), ResolveIntent [/Table/114/1/‹90›/‹2›/‹-3068›/‹14›/‹0›,/Min) to r171:‹/Table/11{4/1/79/5/-2431/8-7}› [(n1,s1):1, (n4,s4):5, (n3,s3):3, next=6, gen=21, sticky=1697278789.191696346,0]; resp: ‹(err: work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s)›
W231014 10:20:57.138898 310484 kv/kvclient/kvcoord/dist_sender.go:1897 ⋮ [T1,Vsystem,n4,s4] 322  slow RPC response: slow RPC finished after 60.04s (1 attempts)
W231014 10:20:57.143646 248440 kv/kvclient/kvcoord/dist_sender.go:1887 ⋮ [T1,Vsystem,n4,f‹9c324142›,job=908356280488263681,distsql.gateway=3,distsql.appname=‹$ internal-resume-job-908356280488263681›] 323  slow range RPC: have been waiting 61.87s (1 attempts) for RPC Export [/Table/114/1/‹79›/‹5›/‹-2431›/‹8›,/Table/114/2), [wait-policy: Error] to r171:‹/Table/11{4/1/79/5/-2431/8-7}› [(n1,s1):1, (n4,s4):5, (n3,s3):3, next=6, gen=21, sticky=1697278789.191696346,0]; resp: ‹(err: operation "intent_resolver_ir_batcher.sendBatch" timed out after 1m0.041s (given timeout 1m0s): work kv deadline expired while waiting: deadline: 2023-10-14 10:20:57.098424558 +0000 UTC m=+377.360809249, start: 2023-10-14 10:19:57.098548 +0000 UTC m=+317.360932704, dur: 1m0.039413971s)›

nvanbenschoten · 2023-11-10T22:56:36Z

Those logs statements don't include that information, as they're coming from a level where we know which range to send to, but not which replica in that range to send to. We may end up trying multiple. We see that range 171 contains: (n1,s1):1, (n4,s4):5, (n3,s3):3.

sumeerbhola · 2023-11-10T23:07:11Z

Those logs statements don't include that information, as they're coming from a level where we know which range to send to, but not which replica in that range to send to. We may end up trying multiple.

What would it take to improve the logging here -- knowing which node/store one was waiting on when the deadline expired would be helpful? Is there an existing issue?

sumeerbhola · 2023-11-13T15:33:31Z

Summary: this is mostly working as intended, since intent resolution is subject to AC.

The two followups for AC team are:

improved error admission: include request priority and queue name in WorkQueue tracing #113990. That is now labeled a GA-blocker.
investigate why intent resolution on behalf of ExportRequest seems to be running as regular instead of elastic write. If it were running at BulkNormalPri as expected, it would be an elastic write based on the code in

cockroach/pkg/kv/kvserver/kvflowcontrol/kvflowcontroller/kvflowcontroller.go

Line 160 in d9296a0

class := admissionpb.WorkClassFromPri(pri)

. And elastic writes would wait on flow control tokens (and timeout there) and not timeout in the WorkQueue as observed here.

Details:
As discussed in #112373 (comment) we can't be sure which node (among (n1,s1):1, (n4,s4):5, (n3,s3):3) was responsible for the AC queueing.
CPU utilization was 75% during the test and p99 goroutine scheduling latency was > 1ms, but runnable goroutine counts were low, so we expect regular CPU (which is what this intent resolution should run as) to not be a bottleneck and elastic CPU to be a bottleneck.

Which is corroborated by slots exhausted (regular requests) being low

And elastic CPU utilization running at its limit of 5% (which is the floor of elastic CPU utilization so can't be reduced further even if goroutine scheduling latency p99 > 1ms)

n1's IO tokens are completely exhausted

And it is not being able to keep its sublevel count at 10

There are various reasons it could be having difficulty controlling the sub-level count, including lack of replication admission control for regular writes it receives as a follower. There are no logs for n1, so we can't investigate further. It doesn't matter -- even if it was able to exercise more control, there would be queueing and the intent resolution would possibly wait (even more so if intent resolution was running as an elastic write).

sumeerbhola · 2023-11-13T15:34:59Z

I'm removing the release and GA blocker labels.
Assigning this to @nvanbenschoten only for creating an issue for #112373 (comment). Once that is done, I think we can close this.

sumeerbhola · 2023-11-13T16:37:10Z

investigate why intent resolution on behalf of ExportRequest seems to be running as regular instead of elastic write

I had forgotten that we bump up the priority when doing intent resolution. This is documented in

cockroach/pkg/kv/kvserver/intentresolver/admission.go

Lines 79 to 87 in a96a6f6

    
           // - Txns that already hold locks/intents use an AC priority that is logically 
        
           //   +1 of the original txn priority. Since there are gaps between the 
        
           //   different user priorities (UserLowPri=-50, NormalPri=0, UserHighPri=50), 
        
           //   work in one part of the key space that executes with NormalPri+1 will not 
        
           //   get prioritized over another part of the key space executing at 
        
           //   UserHighPri. And when different tenantIDs are being used for performance 
        
           //   isolation, the priority is irrelevant. The actual implementation uses a 
        
           //   step that is bigger than +1. See 
        
           //   admissionpb.AdjustedPriorityWhenHoldingLocks.

So ExportRequest triggered intent resolution running as regular writes (and not elastic writes) is expected.

nvanbenschoten · 2023-11-14T19:00:20Z

Assigning this to @nvanbenschoten only for creating an issue for #112373 (comment). Once that is done, I think we can close this.

This is now being tracked #114431, which will help us connect hanging requests to the specific replica that the request is hanging on.

cockroach-teamcity added this to the 23.2 milestone Oct 14, 2023

exalate-issue-sync bot assigned msbutler Oct 16, 2023

This comment was marked as outdated.

Sign in to view

msbutler removed their assignment Oct 16, 2023

msbutler added T-admission-control Admission Control and removed T-disaster-recovery labels Oct 16, 2023

msbutler mentioned this issue Oct 16, 2023

mixed-version framework: logs misleading binary versions after upgrade to <current> #112431

Closed

msbutler added T-kv KV Team and removed T-admission-control Admission Control labels Oct 17, 2023

shralex assigned aadityasondhi Oct 18, 2023

andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 19, 2023

aadityasondhi mentioned this issue Oct 24, 2023

kvserver: warning about empty admissison headers #112680

Open

blathers-crl bot added the T-storage Storage Team label Oct 25, 2023

blathers-crl bot added the T-storage Storage Team label Nov 8, 2023

nvanbenschoten added A-admission-control and removed T-kv KV Team labels Nov 8, 2023

jbowens assigned sumeerbhola Nov 8, 2023

sumeerbhola removed GA-blocker branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 13, 2023

sumeerbhola assigned nvanbenschoten and unassigned sumeerbhola Nov 13, 2023

nvanbenschoten mentioned this issue Nov 14, 2023

kv: log slow client requests with replica information #114431

Closed

nvanbenschoten removed their assignment Nov 14, 2023

nvanbenschoten closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2023

exalate-issue-sync bot closed this as completed Nov 14, 2023

jbowens added this to Storage Jun 4, 2024

jbowens moved this to Done in Storage Jun 4, 2024

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: backup-restore/mixed-version failed #112373

roachtest: backup-restore/mixed-version failed #112373

cockroach-teamcity commented Oct 14, 2023 •

edited by cockroach-jira-scripts

Loading

This comment was marked as outdated.

dt commented Oct 16, 2023

sumeerbhola commented Oct 17, 2023

msbutler commented Oct 17, 2023 •

edited

Loading

msbutler commented Oct 17, 2023

sumeerbhola commented Oct 18, 2023

aadityasondhi commented Oct 24, 2023

nvanbenschoten commented Oct 25, 2023

miraradeva commented Nov 8, 2023

sumeerbhola commented Nov 10, 2023

nvanbenschoten commented Nov 10, 2023

sumeerbhola commented Nov 10, 2023

sumeerbhola commented Nov 13, 2023

sumeerbhola commented Nov 13, 2023

sumeerbhola commented Nov 13, 2023

nvanbenschoten commented Nov 14, 2023

roachtest: backup-restore/mixed-version failed #112373

roachtest: backup-restore/mixed-version failed #112373

Comments

cockroach-teamcity commented Oct 14, 2023 • edited by cockroach-jira-scripts Loading

This comment was marked as outdated.

dt commented Oct 16, 2023

sumeerbhola commented Oct 17, 2023

msbutler commented Oct 17, 2023 • edited Loading

msbutler commented Oct 17, 2023

sumeerbhola commented Oct 18, 2023

aadityasondhi commented Oct 24, 2023

nvanbenschoten commented Oct 25, 2023

miraradeva commented Nov 8, 2023

sumeerbhola commented Nov 10, 2023

nvanbenschoten commented Nov 10, 2023

sumeerbhola commented Nov 10, 2023

sumeerbhola commented Nov 13, 2023

sumeerbhola commented Nov 13, 2023

sumeerbhola commented Nov 13, 2023

nvanbenschoten commented Nov 14, 2023

cockroach-teamcity commented Oct 14, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Oct 17, 2023 •

edited

Loading