STORAGE-4262: Automatically Recover From Failed PRS Operation #5

ghost · 2020-08-17T19:54:35Z

In Vitess v4, a PlannedReparentShard operation could fail for a variety of reasons -- a common one for us being that long running statements (which are single statement transactions) causing the operation to timeout. For example (from here):

running sudo -u vitess vtctl_etsy -alsologtostderr PlannedReparentShard -keyspace_shard=etsy_risk/- -new_master us_central1_c-1423301520 -wait_slave_timeout 6.0s

stderr: I0720 16:36:48.099867   18335 trace.go:151] successfully started tracing with [noop]
I0720 16:36:48.100554   18335 locks.go:359] Locking shard etsy_risk/- for action PlannedReparentShard(us_central1_c-1423301520, avoid_master=)
I0720 16:36:48.106144   18335 logutil.go:31] log: Connected to 10.248.1.152:2181
I0720 16:36:48.108640   18335 logutil.go:31] log: authenticated: id=0x3000008dd8a79f1, timeout=30s
I0720 16:36:48.108745   18335 zk_conn.go:336] zk conn: session for addr vttopo05.c.etsy-vitess-prod.internal:2181,vttopo04.c.etsy-vitess-prod.internal:2181,vttopo02.c.etsy-vitess-prod.internal:2181,vttopo01.c.etsy-vitess-prod.internal:2181,vttopo03.c.etsy-vitess-prod.internal:2181 event: {EventSession StateHasSession   10.248.1.152:2181}
I0720 16:36:48.120467   18335 logutil.go:31] log: Connected to 10.248.4.65:2181
I0720 16:36:48.122578   18335 logutil.go:31] log: authenticated: id=0x20000002ae4787e, timeout=30s
I0720 16:36:48.122658   18335 zk_conn.go:336] zk conn: session for addr vttopo05.c.etsy-vitess-prod.internal:2181,vttopo04.c.etsy-vitess-prod.internal:2181,vttopo02.c.etsy-vitess-prod.internal:2181,vttopo01.c.etsy-vitess-prod.internal:2181,vttopo03.c.etsy-vitess-prod.internal:2181 event: {EventSession StateHasSession   10.248.4.65:2181}
I0720 16:36:48.122692   18335 logutil.go:31] log: Connected to 10.248.4.72:2181
I0720 16:36:48.124254   18335 logutil.go:31] log: authenticated: id=0x400001638827961, timeout=30s
I0720 16:36:48.124311   18335 zk_conn.go:336] zk conn: session for addr vttopo05.c.etsy-vitess-prod.internal:2181,vttopo04.c.etsy-vitess-prod.internal:2181,vttopo02.c.etsy-vitess-prod.internal:2181,vttopo01.c.etsy-vitess-prod.internal:2181,vttopo03.c.etsy-vitess-prod.internal:2181 event: {EventSession StateHasSession   10.248.4.72:2181}
I0720 16:36:48.126227   18335 reparent.go:551] Checking replication on master-elect us_central1_c-1423301520
I0720 16:36:48.181597   18335 reparent.go:586] demote current master cell:"us_central1_a" uid:443280992 
I0720 16:37:18.182228   18335 locks.go:396] Unlocking shard etsy_risk/- for action PlannedReparentShard(us_central1_c-1423301520, avoid_master=) with error old master tablet us_central1_a-0443280992 DemoteMaster failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0720 16:37:18.189053   18335 vtctl.go:115] action failed: PlannedReparentShard old master tablet us_central1_a-0443280992 DemoteMaster failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

This left the shard in a broken state as both sides were in the RO state and the specific tablet that failed would not even serve RO traffic (returns error about being in state NOT_SERVING).

I was able to repeat this behavior using an Etsy Vitess Sandbox container (see details here).

I then backported some related fixes that were pushed after v4 (see vitessio/vitess#5376) -- specifically this code block to undo a failed DemoteMaster operation.

I could then no longer repeat the problem where a failed DemoteMaster left things in a broken state (see here for details).

This improved behavior will allow us to safely use vtctl for failovers and the command-line client we create and use to orchestrate the larger operations around it can add additional safety mechanisms such as examining the current replica lag and any long running statements/transactions on the master -- skipping the the failover attempt for that host unless an additional flag is passed (e.g. --force), this avoiding the window where there is no RW instance in the A/B host pair. But with this new vtctl behavior we are protected from broken states persisting when any edge case occurs (e.g. a new long running statement comes in between our check and the PlannedReparentShard).

It fails after rebootstrapping docker image Signed-off-by: Morgan Tocker <tocker@gmail.com>

Disable prepared statements test

Signed-off-by: Morgan Tocker <tocker@gmail.com>

Signed-off-by: Harshit Gangal <harshit.gangal@gmail.com>

[JAVA] Vitess JDBC release 4.0

Signed-off-by: Morgan Tocker <tocker@gmail.com>

…t-protection Back port stronger root protection

Signed-off-by: Adam Saponara <as@php.net>

…dpoint (#3) `/debug/liveness` endpoint now returns a 503 when `/etc/etsy/depool` is present on the filesystem note: included "unit" test currently relies on `/etc/etsy` existing and being writable. run with `go test go/vt/servenv/*.go` Signed-off-by: Mackenzie Starr <mstarr@etsy.com>

… pool timeout to Vitess 4.x (#4) this should prevent downstream clients from queueing indefinitely to acquire a connection from the stream pool, which we have seen exhausts downstream httpd workers in production when clients hit the stream pool timeout the error message is > stream pool wait time exceeded: resource pool timed out this is a 4.x only patch and can be removed when we upgrade to a vitess version >= 6.x Signed-off-by: Mackenzie Starr <mstarr@etsy.com>

See: vitessio/vitess#5376

ghost · 2020-08-17T19:56:05Z

Gah, crap. I will close this and create another :-|. Have to see how best to create a PR against a non-master branch.

* decouple olap tx timeout from oltp tx timeout Since workload=olap bypasses the query timeouts (--queryserver-config-query-timeout) and also row limits, the natural assumption is that it also bypasses the transaction timeout. This is not the case, e.g. for a tablet where the --queryserver-config-transaction-timeout is 10. This commit: * Adds new CLI flag and YAML field to independently configure TX timeouts for OLAP workloads (--queryserver-config-olap-transaction-timeout). * Decouples TX kill interval from OLTP TX timeout via new CLI flag and YAML field (--queryserver-config-transaction-killer-interval). Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #1 Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #2 consolidate timeout logic in sc Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: remove unused tx killer flag Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: update 15_0_0_summary.md Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: fix race cond Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #3 -txProps.timeout, +sc.expiryTime Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #4 -atomic.Value for expiryTime Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: fix race cond (without atomic.Value) Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #5 -unused funcs, fix comments, set ticks interval once Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #5 +txkill tests Signed-off-by: Max Englander <max@planetscale.com> * revert fmt changes Signed-off-by: Max Englander <max@planetscale.com> * implement pr review suggestion Signed-off-by: Max Englander <max@planetscale.com> Signed-off-by: Max Englander <max@planetscale.com>

morgo and others added 21 commits October 31, 2019 17:06

Disable prepared statements test

8f69fe3

It fails after rebootstrapping docker image Signed-off-by: Morgan Tocker <tocker@gmail.com>

Merge pull request #5389 from planetscale/morgo-fix-ps

d241447

Disable prepared statements test

Merge remote-tracking branch 'origin/master' into release-4.0

ad1f076

Signed-off-by: Morgan Tocker <tocker@gmail.com>

Vitess JDBC release 4.0

700116a

Signed-off-by: Harshit Gangal <harshit.gangal@gmail.com>

Merge pull request #5400 from harshit-gangal/jdbc-4.0

cc07de2

[JAVA] Vitess JDBC release 4.0

Back port stronger root protection

2bd1c15

Signed-off-by: Morgan Tocker <tocker@gmail.com>

Merge pull request #5431 from planetscale/morgo-backport-stronger-roo…

0629f0d

…t-protection Back port stronger root protection

schema_engine_hack.patch

8fcfd95

Signed-off-by: Adam Saponara <as@php.net>

set_workload.patch

e3d43af

Signed-off-by: Adam Saponara <as@php.net>

stream_query_plan_cache.patch

16fa459

Signed-off-by: Adam Saponara <as@php.net>

db_conn_timeout.patch

ab039b0

Signed-off-by: Adam Saponara <as@php.net>

shutdown_fix_p1.patch

87e39cb

Signed-off-by: Adam Saponara <as@php.net>

shutdown_fix_p2.patch

3aac204

Signed-off-by: Adam Saponara <as@php.net>

plan_table_dual.patch

175f2b1

Signed-off-by: Adam Saponara <as@php.net>

master_hostname.patch

356294f

Signed-off-by: Adam Saponara <as@php.net>

default_workload.patch

19ab8b4

Signed-off-by: Adam Saponara <as@php.net>

reduce_packets.patch

8542fd8

Signed-off-by: Adam Saponara <as@php.net>

truncate_error.patch

8be2117

Signed-off-by: Adam Saponara <as@php.net>

STORAGE-4262: Backport fixes to recover from failed PRS

48cafe4

See: vitessio/vitess#5376

Repository owner requested review from dasl- and mackenziestarr August 17, 2020 19:54

Repository owner added the backport label Aug 17, 2020

Repository owner self-assigned this Aug 17, 2020

Repository owner closed this Aug 17, 2020

Repository owner deleted the STORAGE-4262 branch August 17, 2020 20:01

Repository owner restored the STORAGE-4262 branch August 17, 2020 20:01

Repository owner deleted the STORAGE-4262 branch August 18, 2020 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STORAGE-4262: Automatically Recover From Failed PRS Operation #5

STORAGE-4262: Automatically Recover From Failed PRS Operation #5

ghost commented Aug 17, 2020

ghost commented Aug 17, 2020

STORAGE-4262: Automatically Recover From Failed PRS Operation #5

STORAGE-4262: Automatically Recover From Failed PRS Operation #5

Conversation

ghost commented Aug 17, 2020

ghost commented Aug 17, 2020