Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORAGE-4262: Automatically Recover From Failed PRS Operation #5

Closed
wants to merge 21 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Aug 17, 2020

In Vitess v4, a PlannedReparentShard operation could fail for a variety of reasons -- a common one for us being that long running statements (which are single statement transactions) causing the operation to timeout. For example (from here):

running sudo -u vitess vtctl_etsy -alsologtostderr PlannedReparentShard -keyspace_shard=etsy_risk/- -new_master us_central1_c-1423301520 -wait_slave_timeout 6.0s

stderr: I0720 16:36:48.099867   18335 trace.go:151] successfully started tracing with [noop]
I0720 16:36:48.100554   18335 locks.go:359] Locking shard etsy_risk/- for action PlannedReparentShard(us_central1_c-1423301520, avoid_master=)
I0720 16:36:48.106144   18335 logutil.go:31] log: Connected to 10.248.1.152:2181
I0720 16:36:48.108640   18335 logutil.go:31] log: authenticated: id=0x3000008dd8a79f1, timeout=30s
I0720 16:36:48.108745   18335 zk_conn.go:336] zk conn: session for addr vttopo05.c.etsy-vitess-prod.internal:2181,vttopo04.c.etsy-vitess-prod.internal:2181,vttopo02.c.etsy-vitess-prod.internal:2181,vttopo01.c.etsy-vitess-prod.internal:2181,vttopo03.c.etsy-vitess-prod.internal:2181 event: {EventSession StateHasSession   10.248.1.152:2181}
I0720 16:36:48.120467   18335 logutil.go:31] log: Connected to 10.248.4.65:2181
I0720 16:36:48.122578   18335 logutil.go:31] log: authenticated: id=0x20000002ae4787e, timeout=30s
I0720 16:36:48.122658   18335 zk_conn.go:336] zk conn: session for addr vttopo05.c.etsy-vitess-prod.internal:2181,vttopo04.c.etsy-vitess-prod.internal:2181,vttopo02.c.etsy-vitess-prod.internal:2181,vttopo01.c.etsy-vitess-prod.internal:2181,vttopo03.c.etsy-vitess-prod.internal:2181 event: {EventSession StateHasSession   10.248.4.65:2181}
I0720 16:36:48.122692   18335 logutil.go:31] log: Connected to 10.248.4.72:2181
I0720 16:36:48.124254   18335 logutil.go:31] log: authenticated: id=0x400001638827961, timeout=30s
I0720 16:36:48.124311   18335 zk_conn.go:336] zk conn: session for addr vttopo05.c.etsy-vitess-prod.internal:2181,vttopo04.c.etsy-vitess-prod.internal:2181,vttopo02.c.etsy-vitess-prod.internal:2181,vttopo01.c.etsy-vitess-prod.internal:2181,vttopo03.c.etsy-vitess-prod.internal:2181 event: {EventSession StateHasSession   10.248.4.72:2181}
I0720 16:36:48.126227   18335 reparent.go:551] Checking replication on master-elect us_central1_c-1423301520
I0720 16:36:48.181597   18335 reparent.go:586] demote current master cell:"us_central1_a" uid:443280992 
I0720 16:37:18.182228   18335 locks.go:396] Unlocking shard etsy_risk/- for action PlannedReparentShard(us_central1_c-1423301520, avoid_master=) with error old master tablet us_central1_a-0443280992 DemoteMaster failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0720 16:37:18.189053   18335 vtctl.go:115] action failed: PlannedReparentShard old master tablet us_central1_a-0443280992 DemoteMaster failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

This left the shard in a broken state as both sides were in the RO state and the specific tablet that failed would not even serve RO traffic (returns error about being in state NOT_SERVING).

I was able to repeat this behavior using an Etsy Vitess Sandbox container (see details here).

I then backported some related fixes that were pushed after v4 (see vitessio/vitess#5376) -- specifically this code block to undo a failed DemoteMaster operation.

I could then no longer repeat the problem where a failed DemoteMaster left things in a broken state (see here for details).

This improved behavior will allow us to safely use vtctl for failovers and the command-line client we create and use to orchestrate the larger operations around it can add additional safety mechanisms such as examining the current replica lag and any long running statements/transactions on the master -- skipping the the failover attempt for that host unless an additional flag is passed (e.g. --force), this avoiding the window where there is no RW instance in the A/B host pair. But with this new vtctl behavior we are protected from broken states persisting when any edge case occurs (e.g. a new long running statement comes in between our check and the PlannedReparentShard).

morgo and others added 21 commits October 31, 2019 17:06
It fails after rebootstrapping docker image

Signed-off-by: Morgan Tocker <tocker@gmail.com>
Signed-off-by: Morgan Tocker <tocker@gmail.com>
Signed-off-by: Harshit Gangal <harshit.gangal@gmail.com>
Signed-off-by: Morgan Tocker <tocker@gmail.com>
…t-protection

Back port stronger root protection
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
Signed-off-by: Adam Saponara <as@php.net>
…dpoint (#3)

`/debug/liveness` endpoint now returns a 503 when `/etc/etsy/depool` is present on the filesystem

note: included "unit" test currently relies on `/etc/etsy` existing and being writable. run with `go test go/vt/servenv/*.go`

Signed-off-by: Mackenzie Starr <mstarr@etsy.com>
… pool timeout to Vitess 4.x (#4)

this should prevent downstream clients from queueing indefinitely to acquire a connection from the stream pool, which we have seen exhausts downstream httpd workers in production

when clients hit the stream pool timeout the error message is 
> stream pool wait time exceeded: resource pool timed out

this is a 4.x only patch and can be removed when we upgrade to a vitess version >= 6.x

Signed-off-by: Mackenzie Starr <mstarr@etsy.com>
Repository owner requested review from dasl- and mackenziestarr August 17, 2020 19:54
Repository owner added the backport label Aug 17, 2020
Repository owner self-assigned this Aug 17, 2020
@ghost
Copy link
Author

ghost commented Aug 17, 2020

Gah, crap. I will close this and create another :-|. Have to see how best to create a PR against a non-master branch.

Repository owner closed this Aug 17, 2020
Repository owner deleted the STORAGE-4262 branch August 17, 2020 20:01
Repository owner restored the STORAGE-4262 branch August 17, 2020 20:01
Repository owner deleted the STORAGE-4262 branch August 18, 2020 19:22
jmchen28 pushed a commit that referenced this pull request Jun 13, 2023
* decouple olap tx timeout from oltp tx timeout

Since workload=olap bypasses the query timeouts
(--queryserver-config-query-timeout) and also row limits, the natural
assumption is that it also bypasses the transaction timeout.

This is not the case, e.g. for a tablet where the
--queryserver-config-transaction-timeout is 10.

This commit:

 * Adds new CLI flag and YAML field to independently configure TX
   timeouts for OLAP workloads (--queryserver-config-olap-transaction-timeout).
 * Decouples TX kill interval from OLTP TX timeout via new CLI flag and
   YAML field (--queryserver-config-transaction-killer-interval).

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #1

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #2 consolidate timeout logic in sc

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: remove unused tx killer flag

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: update 15_0_0_summary.md

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: fix race cond

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #3 -txProps.timeout, +sc.expiryTime

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #4 -atomic.Value for expiryTime

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: fix race cond (without atomic.Value)

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #5 -unused funcs, fix comments, set ticks interval once

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #5 +txkill tests

Signed-off-by: Max Englander <max@planetscale.com>

* revert fmt changes

Signed-off-by: Max Englander <max@planetscale.com>

* implement pr review suggestion

Signed-off-by: Max Englander <max@planetscale.com>

Signed-off-by: Max Englander <max@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants