Skip to content

Commit

Permalink
Bug#34240269: gr_clone_applier_stop is failing on PB2
Browse files Browse the repository at this point in the history
Scenario: The applier is applying the data and the clone is stopping
the applier thread.
Issue:
The stop applier thread initiated by the clone is making the applier fail
with below error:
Slave SQL for channel 'group_replication_applier': Coordinator thread of
multi-threaded slave is being stopped in the middle of assigning a group
of events; deferring to exit until the group completion ... ,
Error_code: MY-000000
The error will cause the member to go in ERROR state.

Resolution:
When stop applier is initiated by the clone, the applier errors should be
ignored.
Additionally the test case has been corrected to test the clone failure.
Asserts have been added to make sure CLONE failed and applier was OFF during
clone.
Also added asserts to make sure the recovery channel was started and applier
was re-started post clone failure.
Additionaly DEBUG scope has been reduced and INC has been used where ever
required.

Change-Id: Id99da177a7d3ef34d4be660b0eb65a4a61fa741c
  • Loading branch information
Jaideep Karande committed Oct 20, 2022
1 parent 955cdac commit dc831c9
Show file tree
Hide file tree
Showing 5 changed files with 97 additions and 29 deletions.
30 changes: 20 additions & 10 deletions mysql-test/suite/group_replication/r/gr_clone_applier_stop.result
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,16 @@ SET @@GLOBAL.DEBUG='+d,gr_clone_before_applier_stop';
START GROUP_REPLICATION;;
[connection server_2]
SET DEBUG_SYNC = 'now SIGNAL applier_stopped';
SET @@GLOBAL.DEBUG='-d,gr_clone_before_applier_stop';
[connection server2]
SET @@GLOBAL.DEBUG= '-d,force_sql_thread_error';
SET DEBUG_SYNC= 'RESET';
include/assert.inc [Clone must not start.]

# 4. Reset debug points for applier failures.
# Restart GR on M2.
# Assert clone starts and group_replication_applier SQL thread is OFF.

SET @@GLOBAL.DEBUG= '-d,force_sql_thread_error';
SET @@GLOBAL.DEBUG='-d,gr_clone_before_applier_stop';
SET DEBUG_SYNC= 'RESET';
SET @@GLOBAL.DEBUG='+d,gr_clone_wait';
START GROUP_REPLICATION;
SET DEBUG_SYNC = 'now WAIT_FOR gr_clone_paused';
Expand Down Expand Up @@ -70,30 +70,39 @@ SET @@GLOBAL.DEBUG='+d,force_sql_thread_error';
SET DEBUG_SYNC = "now SIGNAL resume_applier_read";
include/gr_wait_for_member_state.inc
SET @@GLOBAL.DEBUG='-d,force_sql_thread_error';
STOP GROUP_REPLICATION;
include/stop_group_replication.inc
[connection server1]
INSERT INTO t1 values (5);
INSERT INTO t1 values (6);

# 7. Block applier and start GR on M2.
# Unblock applier when clone is started.
# Assert pending transactions are applied when applier is restarted.
# 7. Start GR on M2.
# Clone will fail and incremental recovery will start.
# Applier will be OFF till clone failure is detected.

[connection server1]
SET @@GLOBAL.DEBUG='+d,block_applier_updates';
[connection server2]
SET @@GLOBAL.DEBUG='+d,gr_run_clone_query_fail_once';
SET GLOBAL group_replication_clone_threshold= 1;
SET @@GLOBAL.DEBUG='+d,block_applier_updates';
START GROUP_REPLICATION;
SET DEBUG_SYNC = 'now WAIT_FOR signal.run_clone_query_waiting';
SET @@GLOBAL.DEBUG='-d,gr_run_clone_query_fail_once';
include/assert.inc ["Clone is executing"]
include/assert.inc [group_replication_applier SQL Thread will be OFF.]
SET DEBUG_SYNC = 'now SIGNAL signal.run_clone_query_continue';
include/assert.inc [group_replication_applier SQL Thread will be ON.]
[connection server1]
SET DEBUG_SYNC = "now WAIT_FOR applier_read_blocked";
SET @@GLOBAL.DEBUG='-d,block_applier_updates';
SET DEBUG_SYNC = "now SIGNAL resume_applier_read";
[connection server2]
include/gr_wait_for_member_state.inc
SET DEBUG_SYNC= 'RESET';
include/diff_tables.inc [server1:test.t1, server2:test.t1]

# 8. Cleanup.

DROP TABLE t1;
[connection server1]
DROP TABLE t1;
set session sql_log_bin=0;
call mtr.add_suppression("Timeout while waiting for the group communication engine to exit!");
call mtr.add_suppression("The member has failed to gracefully leave the group.");
Expand All @@ -117,6 +126,7 @@ call mtr.add_suppression("Due to some issue on the previous step distributed rec
call mtr.add_suppression("Timeout while waiting for the group communication engine to be ready!");
call mtr.add_suppression("The group communication engine is not ready for the member to join. .*");
call mtr.add_suppression("The member was unable to join the group.*");
call mtr.add_suppression("There was an issue when configuring the remote cloning process: The plugin was not able to stop the group_replication_applier channel.");
set session sql_log_bin=1;
RESET PERSIST group_replication_group_name;
RESET PERSIST group_replication_local_address;
Expand Down
67 changes: 49 additions & 18 deletions mysql-test/suite/group_replication/t/gr_clone_applier_stop.test
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@
# This test confirms that:
# 1: On applier failure clone does not start.
# 2: During clone group_replication_applier SQL thread is OFF.
# 3: If clone fails group_replication_applier is restarted and pending
# transactions(relay logs) are applied.
# 3: If clone fails group_replication_applier is restarted.
#
# Test:
# 0. The test requires two servers: M1 and M2
Expand All @@ -20,8 +19,9 @@
# Create some transactions on M1 to create applier backlog on M2.
# 6. Stop GR on M2 without committing the received transactions.
# Create transactions on M1 for M2 to clone.
# 7. Start GR on M2 and unblock applier when clone is started.
# Assert pending transactions are applied when applier is restarted.
# 7. Start GR on M2.
# Clone will fail and incremental recovery will start.
# Applier will be OFF till clone failure is detected.
# 8. Cleanup.
################################################################################

Expand Down Expand Up @@ -98,6 +98,7 @@ SET @@GLOBAL.DEBUG='+d,gr_clone_before_applier_stop';
--source include/wait_condition_or_abort.inc

SET DEBUG_SYNC = 'now SIGNAL applier_stopped';
SET @@GLOBAL.DEBUG='-d,gr_clone_before_applier_stop';

--let $rpl_connection_name= server2
--source include/rpl_connection.inc
Expand All @@ -115,6 +116,9 @@ SET DEBUG_SYNC = 'now SIGNAL applier_stopped';
--error 0, ER_GROUP_REPLICATION_CONFIGURATION, ER_GROUP_REPLICATION_APPLIER_INIT_ERROR
--reap

SET @@GLOBAL.DEBUG= '-d,force_sql_thread_error';
SET DEBUG_SYNC= 'RESET';

--let $assert_text= Clone must not start.
--let $assert_cond= [SELECT COUNT(*) FROM performance_schema.clone_status] = 0;
--source include/assert.inc
Expand All @@ -125,9 +129,6 @@ SET DEBUG_SYNC = 'now SIGNAL applier_stopped';
--echo # Assert clone starts and group_replication_applier SQL thread is OFF.
--echo

SET @@GLOBAL.DEBUG= '-d,force_sql_thread_error';
SET @@GLOBAL.DEBUG='-d,gr_clone_before_applier_stop';
SET DEBUG_SYNC= 'RESET';
SET @@GLOBAL.DEBUG='+d,gr_clone_wait';

START GROUP_REPLICATION;
Expand Down Expand Up @@ -186,7 +187,7 @@ SET DEBUG_SYNC = "now SIGNAL resume_applier_read";
--source include/gr_wait_for_member_state.inc
SET @@GLOBAL.DEBUG='-d,force_sql_thread_error';

STOP GROUP_REPLICATION;
--source include/stop_group_replication.inc

--let $rpl_connection_name= server1
--source include/rpl_connection.inc
Expand All @@ -195,41 +196,70 @@ INSERT INTO t1 values (5);
INSERT INTO t1 values (6);

--echo
--echo # 7. Block applier and start GR on M2.
--echo # Unblock applier when clone is started.
--echo # Assert pending transactions are applied when applier is restarted.
--echo # 7. Start GR on M2.
--echo # Clone will fail and incremental recovery will start.
--echo # Applier will be OFF till clone failure is detected.
--echo

--let $rpl_connection_name= server1
--source include/rpl_connection.inc
SET @@GLOBAL.DEBUG='+d,block_applier_updates';

--let $rpl_connection_name= server2
--source include/rpl_connection.inc

SET @@GLOBAL.DEBUG='+d,gr_run_clone_query_fail_once';
SET GLOBAL group_replication_clone_threshold= 1;
SET @@GLOBAL.DEBUG='+d,block_applier_updates';
START GROUP_REPLICATION;

SET DEBUG_SYNC = "now WAIT_FOR applier_read_blocked";
SET DEBUG_SYNC = 'now WAIT_FOR signal.run_clone_query_waiting';
SET @@GLOBAL.DEBUG='-d,gr_run_clone_query_fail_once';

# Clone is executing
--let $assert_text= "Clone is executing"
--let $assert_cond= [SELECT COUNT(*) FROM performance_schema.events_stages_current WHERE event_name LIKE "%stage/group_rpl/Group Replication Cloning%"] = 1
--source include/assert.inc

--let $assert_text= group_replication_applier SQL Thread will be OFF.
--let $assert_cond= [SELECT COUNT(*) as count FROM performance_schema.replication_applier_status WHERE CHANNEL_NAME="group_replication_applier" AND SERVICE_STATE = "OFF",count, 1] = 1
--source include/assert.inc

SET DEBUG_SYNC = 'now SIGNAL signal.run_clone_query_continue';

--let $wait_condition= SELECT COUNT(*)=1 FROM performance_schema.threads WHERE PROCESSLIST_STATE="Group Replication Cloning process: Preparing";
# Clone will fail and will start channel group_replication_recovery
--let $wait_condition=SELECT COUNT(*)=1 FROM performance_schema.replication_connection_status WHERE CHANNEL_NAME="group_replication_recovery" AND SERVICE_STATE='ON'
--source include/wait_condition.inc

--let $assert_text= group_replication_applier SQL Thread will be ON.
--let $assert_cond= [SELECT COUNT(*) as count FROM performance_schema.replication_applier_status WHERE CHANNEL_NAME="group_replication_applier" AND SERVICE_STATE = "ON",count, 1] = 1
--source include/assert.inc

# Allow the recovery to continue
--let $rpl_connection_name= server1
--source include/rpl_connection.inc
SET DEBUG_SYNC = "now WAIT_FOR applier_read_blocked";
SET @@GLOBAL.DEBUG='-d,block_applier_updates';
SET DEBUG_SYNC = "now SIGNAL resume_applier_read";

--let $rpl_connection_name= server2
--source include/rpl_connection.inc

--let $group_replication_member_state=ONLINE
--source include/gr_wait_for_member_state.inc

SET DEBUG_SYNC= 'RESET';

--let $diff_tables= server1:test.t1, server2:test.t1
--source include/diff_tables.inc


--echo
--echo # 8. Cleanup.
--echo

DROP TABLE t1;

--let $rpl_connection_name= server1
--source include/rpl_connection.inc

DROP TABLE t1;

set session sql_log_bin=0;
call mtr.add_suppression("Timeout while waiting for the group communication engine to exit!");
call mtr.add_suppression("The member has failed to gracefully leave the group.");
Expand All @@ -255,6 +285,7 @@ call mtr.add_suppression("Due to some issue on the previous step distributed rec
call mtr.add_suppression("Timeout while waiting for the group communication engine to be ready!");
call mtr.add_suppression("The group communication engine is not ready for the member to join. .*");
call mtr.add_suppression("The member was unable to join the group.*");
call mtr.add_suppression("There was an issue when configuring the remote cloning process: The plugin was not able to stop the group_replication_applier channel.");
set session sql_log_bin=1;

RESET PERSIST group_replication_group_name;
Expand Down
14 changes: 14 additions & 0 deletions plugin/group_replication/include/applier.h
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,18 @@ class Applier_module : public Applier_module_interface {
*/
void inform_of_applier_stop(char *channel_name, bool aborted);

/**
Check whether to ignore applier errors during stop or not.
Errors put the members into ERROR state.
If errors are ignored member will stay in ONLINE state.
During clone, applier errors are ignored, since data will come from clone.
@param[in] ignore_errors if true ignore applier errors during stop
*/
void ignore_errors_during_stop(bool ignore_errors) {
m_ignore_applier_errors_during_stop = ignore_errors;
}

// Packet based interface methods

/**
Expand Down Expand Up @@ -873,6 +885,8 @@ class Applier_module : public Applier_module_interface {
int applier_error;
/* Applier killed status */
bool applier_killed_status;
/* Ignore applier errors during stop. */
bool m_ignore_applier_errors_during_stop{false};

// condition and lock used to suspend/awake the applier module
/* The lock for suspending/wait for the awake of the applier module */
Expand Down
13 changes: 12 additions & 1 deletion plugin/group_replication/src/applier.cc
Original file line number Diff line number Diff line change
Expand Up @@ -783,10 +783,21 @@ int Applier_module::terminate_applier_thread() {
void Applier_module::inform_of_applier_stop(char *channel_name, bool aborted) {
DBUG_TRACE;

/*
This function is called when async replication applier thread is stopped.
The stop of async replication applier thread is not an issue, however when
async replication applier thread stops because of some errors, GR applier
pipeline is also stopped and member goes in the ERROR state.
The function parameter 'aborted' informs about the async replication
applier thread errors.
When the async replication applier thread stop is initiated by Clone GR
(m_ignore_applier_errors_during_stop=true), GR applier pipeline should
ignore async replication applier thread errors.
*/
if (!strcmp(channel_name, applier_module_channel_name) && aborted &&
!m_ignore_applier_errors_during_stop &&
applier_thd_state.is_thread_alive()) {
LogPluginErr(ERROR_LEVEL, ER_GRP_RPL_APPLIER_THD_EXECUTION_ABORTED);

applier_error = 1;

// before waiting for termination, signal the queue to unlock.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,7 @@ void Remote_clone_handler::gr_clone_debug_point() {
#endif /* NDEBUG */
// Ignore any channel stop error and confirm channel is stopped or not.
// Since we will clone next.
applier_module->ignore_errors_during_stop(true);
applier_channel.stop_threads(false, true);
if (applier_channel.is_applier_thread_running()) {
/* purecov: begin inspected */
Expand Down Expand Up @@ -872,6 +873,7 @@ void Remote_clone_handler::gr_clone_debug_point() {
thd_end:

declare_plugin_cloning(false);
applier_module->ignore_errors_during_stop(false);

if (error && !m_being_terminated) {
fallback_to_recovery_or_leave(critical_error);
Expand Down

0 comments on commit dc831c9

Please sign in to comment.