Skip to content

graceful shutdown failed under load #218

Open
@temeo

Description

@temeo

First seen after 5.6.27 merge. Node was shut down under sysbench load. According to log it left the group gracefully:

2015-10-26 11:23:26 7158 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2015-10-26 11:23:26 7158 [Note] WSREP: evs::proto(5111b753, LEAVING, view_id(REG,5111b753,105)): delivering view view((empty))
2015-10-26 11:23:26 7158 [Note] WSREP: view((empty))
2015-10-26 11:23:26 7158 [Note] WSREP: gcomm: closed
2015-10-26 11:23:26 7158 [Note] WSREP: Flow-control interval: [16, 16]
2015-10-26 11:23:26 7158 [Note] WSREP: Received NON-PRIMARY.
2015-10-26 11:23:26 7158 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 2247843)
2015-10-26 11:23:26 7158 [Note] WSREP: Received self-leave message.
2015-10-26 11:23:26 7158 [Note] WSREP: Flow-control interval: [0, 0]
2015-10-26 11:23:26 7158 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2015-10-26 11:23:26 7158 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 2247843)
2015-10-26 11:23:26 7158 [Note] WSREP: RECV thread exiting 0: Success
2015-10-26 11:23:26 7158 [Note] WSREP: recv_thread() joined.
2015-10-26 11:23:26 7158 [Note] WSREP: Closing replication queue.
2015-10-26 11:23:26 7158 [Note] WSREP: Closing slave action queue.
2015-10-26 11:23:26 7158 [Note] WSREP: New cluster view: global state: 026ce007-7bd1-11e5-a121-c7e1faf6ae05:2247843, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2015-10-26 11:23:26 7158 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-10-26 11:23:26 7158 [Note] WSREP: New cluster view: global state: 026ce007-7bd1-11e5-a121-c7e1faf6ae05:2247843, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
2015-10-26 11:23:26 7158 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-10-26 11:23:26 7158 [Note] WSREP: applier thread exiting (code:0)
2015-10-26 11:23:28 7158 [Note] WSREP: killing local connection: 28
2015-10-26 11:23:28 7158 [Note] WSREP: killing local connection: 27
2015-10-26 11:23:28 7158 [Note] WSREP: killing local connection: 25
2015-10-26 11:23:28 7158 [Note] WSREP: killing local connection: 34
2015-10-26 11:23:28 7158 [Note] WSREP: killing local connection: 35
2015-10-26 11:23:28 7158 [Note] WSREP: killing local connection: 36

However, the process is stuck in waiting for client connections to close, relevant threads:

Thread 4 (Thread 0x7f71845db700 (LWP 7488)):
#0  0x00007f7192c2f0d1 in do_sigwait (sig=0x7f71845dada0, set=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:60
#1  __sigwait (set=0x7f71845dae00, sig=0x7f71845dada0)
    at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:97
#2  0x00000000005a9183 in signal_hand ()
#3  0x00007f7192c27182 in start_thread (arg=0x7f71845db700)
    at pthread_create.c:312
#4  0x00007f719213447d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7f7184518700 (LWP 7622)):
#0  0x00007f719212712d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000b66bf8 in vio_io_wait ()
#2  0x0000000000b66c68 in vio_socket_io_wait ()
#3  0x0000000000b66d36 in vio_read ()
#4  0x000000000067671f in net_read_raw_loop(st_net*, unsigned long) ()
#5  0x00000000006769ce in net_read_packet(st_net*, unsigned long*) ()
#6  0x000000000067777d in my_net_read ()
#7  0x0000000000703d8d in do_command(THD*) ()
#8  0x00000000006ce9e2 in do_handle_one_connection(THD*) ()
#9  0x00000000006ceb77 in handle_one_connection ()
#10 0x00007f7192c27182 in start_thread (arg=0x7f7184518700)
    at pthread_create.c:312
#11 0x00007f719213447d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f7184310700 (LWP 7739)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00000000005abb2f in wsrep_close_client_connections(char) ()
#2  0x00000000005bc45b in wsrep_stop_replication(THD*) ()
#3  0x00000000005b452f in kill_server(void*) [clone .constprop.215] ()
#4  0x00000000005b481e in kill_server_thread ()
#5  0x00007f7192c27182 in start_thread (arg=0x7f7184310700)
    at pthread_create.c:312
#6  0x00007f719213447d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Netstat shows that there are three connections in CLOSE_WAIT state:

tcp6       1      0 172.17.0.76:3303        172.17.0.77:52714       CLOSE_WAIT 
tcp6       1      0 172.17.0.76:3303        172.17.0.77:52717       CLOSE_WAIT 
tcp6       1      0 172.17.0.76:3303        172.17.0.77:52709       CLOSE_WAIT 

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions