PadoGrid | Catalogs | Manual | FAQ | Releases | Templates | Pods | Kubernetes | Docker | Apps | Quick Start
This bundle provides scripts, configuration files, and apps for creating a Geode/GemFire network split-brain environment where you can test Geode/GemFire split-brain capabilities. Although it focuses on split-brain issues, the scripts provided cover any system failures.
- Installing Bundle
- Use Case
- Required Software
- Bundle Contents
- Installation Steps
- Startup Sequence
- Network Partition
- Test Cases
- Test Results
- GemFire Network Partition
- Teardown
- References
install_bundle -download -workspace bundle-geode-1-app-perf_test_sb-cluster-sb
To prepare for encountering cluster split-brain situations, this use case provides step-by-step instructions for creating and diagnosing five (5) types of network parition covering all possibilities.
- Type 0: Single data node isolated, locators unreachable.
- Type 1: Data nodes isolated, locators unreachable.
- Type 2: Locators divided.
- Type 3: Quorum without locators.
- Type 4: Locators isolated, data nodes unreachable.
- Type 5: Data nodes isolated, locators reachable.
The following diagram shows Network Partition Type 2.
- Vagrant (1)
- VirtualBox (1)
- Geode or GemFire (2)
- Linux JDK (2)
- This bundle uses PadoGrid pods which require Vagrant and VirtualBox. If you have not installed them, then please download and install them now by following the links above. For details on PadoGrid pods, see Understanding PadoGrid Pods.
- We need Geode or GemFire and JDK for Linux in the VirtualBox VMs. We will install them later.
apps
└── perf_test_sb
clusters
└── sb
This bundle includes the following components.
-
Cluster sb. The sb cluster is configured with five (5) VM members running in the
pod_sb
pod. It includes scripts that useiptables
to drop TCP packets to split thesb
cluster into two (2). It also includes scripts that scrape log files to display condensed timelines of key events. -
App perf_test_sb. The
perf_test_sb
app is configured to run on a split cluster.
Note that the sb
cluster is configured to run in the pod_sb
pod with its members running as VM hosts and not Vagrant pod hosts.
We will be taking the following steps as we setup and run the environment.
- Install Linux products
- Install Geode or GemFire on host OS
- Create pod
- Build pod
Follow the instructions in the subsequent sections.
By default, PadoGrid installs all the products in the ~/Padogrid/products
directory. If you have not installed Geode or GemFire in that directory, then do so now by following the instructions in one of the following subsections.
# Geode: Install Geode on Host OS if not already installed
install_padogrid -product geode
update_product -product geode
Download GemFire into the ~/Downloads
directory and execute the following.
# GemFire: Install GemFire on Host OS if not already installed
tar -C ~/Padogrid/products ~/Downloads/pivotal-gemfire-9.10.16.tgz
update_product -product gemfire
We also need Geode/GemFire and Java installed for Linux before we can setup Vagrant VMs. Download a Linux JDK tarball from the following link.
If your host OS is Linux, then install the downloaded JDK in the ~/Padogrid/products
directory as follows.
tar -C ~/Padogrid/products -xzf ~/Downloads/jdk-8u333-linux-x64.tar.gz
If your host OS is NOT Linux, then install the downloaded JDK and Geode/GemFire in the ~/Padogrid/products/linux
directory as follows.
mkdir ~/Padogrid/products/linux
# JDK
tar -C ~/Padogrid/products/linux -xzf ~/Downloads/jdk-8u333-linux-x64.tar.gz
# Geode - 'install_padogrid' previously downloaded the Geode tarball in ~/Padogrid/downloads/
tar -C ~/Padogrid/products/linux -xzf ~/Padogrid/downloads/apache-geode-1.14.4.tgz
# GemFire
tar -C ~/Padogrid/products/linux -xzf ~/Downloads/pivotal-gemfire-9.10.16.tgz
Inflating the tarballs creates the following directories.
Software for VMs | Host OS Path |
---|---|
JDK | ~/Padogrid/products/linux/jdk-8u333-linux-x64 |
Geode | ~/Padogrid/products/linux/apache-geode-1.14.4 |
GemFire | ~/Padogrid/products/linux/pivotal-gemfire-9.10.16 |
First, set the products directory path required by create_pod
.
# If your host OS is Linux and you have installed JDK in the `~/Padogrid/products`:
LINUX_PRODUCTS_DIR="$HOME/Padogrid/products"
# If your host OS is NOT Linux and you have installed JDK in the `~/Padogrid/products/linux`:
LINUX_PRODUCTS_DIR="$HOME/Padogrid/products/linux"
Create a pod named pod_sb
with seven (7) nodes. The pod name must be pod_sb
since the included cluster, sb
, has been paired with that pod name. Each VM should have at least 1024 MiB of memory.
# Directory specified by '-dir' is the host OS directory where the JDK and IMDG are installed.
# Configure 1536 MiB for primary and data nodes. Enable Avahi to allow hostname lookup of
# *.local hostnames via mDNS, i.e., pnode.local, node-01.local, etc.
create_pod -quiet \
-ip 192.168.56.10 \
-avahi \
-pod pod_sb \
-pm 1536 \
-nm 1536 \
-count 7 \
-dir $LINUX_PRODUCTS_DIR
Build the pod you just created.
# Build and start the pod
build_pod -pod pod_sb
The sb
cluster has been preconfigured as follows. Plese see etc/cluster.properties
Property | Value |
---|---|
pod.name | pod_sb |
locator.heap.max | 512m |
heap.max | 512m |
vm.enabled | true |
vm.locator.hosts | node-06.local,node-07.local |
vm.hosts | node-01.local,node-02.local,node-03.local,node-04.local,node-05.local |
We have allocated seven (7) nodes: 2 for locators and 5 for data nodes. The primary node, pnode.local
, is reserved for running scripts to split the network.
Login to pnode.local
and run install_ntpd
as follows.
# Login to pnode.local
ssh vagrant@pnode.local
# Install ntpd
switch_cluster sb/bin_sh
./install_ntpd
Due to Vagrant security restrictions, we need to edit each node's /etc/hosts
and add a respective host name entry. We can do this manually one file at a time or run the included update_etc_hosts
to automatically update each node's /etc/hosts
file.
From pnode.local
, run update_etc_hosts
as follows.
cd_cluster sb/bin_sh
./update_etc_hosts
This bundle also includes the list_etc_host
script which lists all the VM host names defined in the etc/cluster.properties
. If the update_etc_hosts
script fails to update /etc/hosts
then you can run this script to list the node entries and copy/paste them into each node's /etc/hosts
file. Make sure to comment out the 127.0.1.1
entry from each node's /etc/hosts
.
From pnode.local
, run list_etc_hosts
as follows.
cd_cluster sb/bin_sh
./list_etc_hosts
Output:
192.168.56.10 pnode pnode.local
192.168.56.11 node-01 node-01.local
192.168.56.12 node-02 node-02.local
192.168.56.13 node-03 node-03.local
192.168.56.14 node-04 node-04.local
192.168.56.15 node-05 node-05.local
192.168.56.16 node-06 node-06.local
192.168.56.17 node-07 node-07.local
From pnode.local
, start the sb
cluster as follows:
# To make sure node-01 gets selected as the lead member, run 'start_cluster_lead_1' in 'bin_sh/'
switch_cluster sb/bin_sh
./start_cluster_lead_1
# Verify 2 locators and 5 members running
show_cluster
Enter the following URL in your browser to monitor the cluster from the Geode Pulse.
http://node-06.local:7070/pulse
From your host OS, build perf_test_sb
and run test_group
as follows:
cd_app perf_test_sb/bin_sh
./test_group -run
To check the region entry counts, run test_group
as follows.
./test_group -list
Output:
...
group: g1
- name: replicate_persistent_overflow
data: Region
size: 10000
cleared: false
- name: partition_persistent
data: Region
size: 10000
cleared: false
- name: partition
data: Region
size: 10000
cleared: false
- name: replicate
data: Region
size: 10000
cleared: false
- name: replicate_persistent
data: Region
size: 10000
cleared: false
- name: partition_overflow
data: Region
size: 10000
cleared: false
- name: partition_persistent_overflow
data: Region
size: 10000
cleared: false
- name: replicate_overflow
data: Region
size: 10000
cleared: false
...
We closely follow the guidelines in the GemFire documentation on Network Partition.
- How Network Partitioning Management Works
- Failure Detection and Membership Views
- Membership Coordinators, Lead Members and Member Weighting
- Log Messages and Solutions
By default, individual members are assigned weights as shown below. The membership coordinator determines quorum by summing up the weights. Please see Membership Coordinators, Lead Members and Member Weighting for details.
Member Role | Weight |
---|---|
lead | 15 |
member | 10 |
locator | 3 |
We carry out six (6) network partition scenarios using the included scripts. We refer each scenario by type so that we can quickly identify them. For each type, we split the network into two (2) and refer them as Split A and Split B.
Type 0: Split A with one (1) node
Split | Weight | VM Hosts |
---|---|---|
A | 15 (25%) | node-01 (lead) |
B | 46 (75%) | node-02, node-03, node-04, node-05, node-06, node-07 (locators) |
Type 1: Split A without locators
Split | Weight | VM Hosts |
---|---|---|
A | 25 (41%) | node-01 (lead), node-02 |
B | 36 (59%) | node-03, node-04, node-05, node-06, node-07 (locators) |
Type 2: Split A and B each with a locator
Split | Weight | VM Hosts |
---|---|---|
A | 28 (46%) | node-01 (lead), node-02, node-06 (locator) |
B | 33 (54%) | node-03, node-04, node-05, node-07 (locator) |
Type 3: Split B quorum without locators
Split | Weight | VM Hosts |
---|---|---|
A | 26 (43%) | node-01, node-02, node-06, node-07 (locators) |
B | 35 (57%) | node-03 (lead), node-04, node-05 |
Type 4: Split A with locators but without data nodes
Split | Weight | VM Hosts |
---|---|---|
A | 6 (10%) | node-06, node-07 (locators) |
B | 55 (90%) | node-01 (lead), node-02, node-03, node-04, node-05 |
Type 5: Split A and B with both locators
Split | Weight | VM Hosts |
---|---|---|
A | 31 (51%) | node-01 (lead), node-02, node-06, node-07 (locators) |
B | 36 (59%) | node-03, node-04, node-05, node-06, node-07 (locators) |
Before we split the network, we ingest data into eight (8) regions by running test_group
as shown blow.
Region | Data Policy (refid) |
---|---|
partition | PARTITION_REDUNDANT |
partition_overflow | PARTITION_REDUNDANT_OVERFLOW |
partition_persistent | PARTITION_REDUNDANT_PERSISTENT |
partition_persistent_overflow | PARTITION_REDUNDANT_PERSISTENT_OVERFLOW |
replicate | REPLICATE |
replicate_overflow | REPLICATE_OVERFLOW |
replicate_persistent | REPLICATE_PERSISTENT |
replicate_persistent_overflow | REPLICATE_PERSISTENT_OVERFLOW |
We expect persistent regions to recover data and non-persistent regions to lose data upon network recovery.
All test cases must be conducted on pnode.local
. Login to pnode.local
and switch cluster as follows.
# Login to pnode.local
ssh vagrant@pnode.local
# Switch to sb/bin_sh which contains network partitioning scripts
switch_cluster sb/bin_sh
The sb/bin_sh
folder contains the following network partitioning scripts. The -?
option displays the command usage.
Script | Description |
---|---|
list_rules | List all rules in iptables for all VM hosts (nodes) in the cluster sb |
merge_cluster | Merge the split clusters by resetting iptables |
split_cluster | Split the cluster sb into two (2) |
PadoGrid provides the Geode/GemFire log scraping scripts as follows. These scripts will help us analyzing the state of the cluster. The -?
option displays the command usage.
Script | Description |
---|---|
t_revoke_all_missing_disk_stores | Iteratively revoke all missing data stores |
t_show_all_suspect_node_pairs | Find the specified suspect node from all log files |
t_show_all_unexpectedly_left_members | Display unexpectedly left members in chronological order |
t_show_all_unexpectedly_shutdown_removal | Find the members that unexpectedly shutdown for removal from the cluster |
t_show_cluster_views | Display cluster views in chronological order |
t_show_member_join_requests | Display member join requests received by the locators in chronological order |
t_show_membership_service_failure | Display membership service failure and restarted messages from locators |
t_show_missing_disk_stores | Display missing disk stores |
t_show_offline_members | Display offline regions per member |
t_show_quorum_check | Display quorum check status if any |
t_show_recovery_steps | Display recovery steps for the specfied type |
t_show_stuck_threads | Find stuck threads |
t_show_suspect_node_pair | Find the specified suspect node pair from the log files |
t_show_type | Determine the network partition type |
To display the sb
cluster status, run the show_cluster
command as follows.
show_cluster -long
Output:
----------------------------------------------------------------
CLUSTER: sb
CLUSTER_DIR: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb
PRODUCT: geode
Deployment: VM
01 Locator: sb-locator-node-06-01
STATE: Running
PID: 13570
LOCATOR_PORT: 10334
PULSE_URL: http://node-06.local:7070/pulse
JMX_URL: service:jmx:rmi:///jndi/rmi://node-06.local:12001/jmxrmi
Working Dir: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-locator-node-06-01
LOG_FILE: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-locator-node-06-01.log
02 Locator: sb-locator-node-07-01
STATE: Running
PID: 22968
LOCATOR_PORT: 10334
PULSE_URL: http://node-07.local:7070/pulse
JMX_URL: service:jmx:rmi:///jndi/rmi://node-07.local:12001/jmxrmi
Working Dir: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-locator-node-07-01
LOG_FILE: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-locator-node-07-01.log
01 Member: sb-node-01
STATE: Running
PID: 26513
MEMBER_PORT: 40404
MEMBER_HTTP_URL: http://node-01.local:7080/geode/swagger-ui.html
DEBUG_PORT: 9101
PROMETHEUS_PORT: 8091
JMX_URL: service:jmx:rmi:///jndi/rmi://node-01.local:12001/jmxrmi
Working Dir: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-node-01
LOG_FILE: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-01.log
...
We can see from the above output, the sb
cluster has been configured as follows.
Nodes | VM Hosts |
---|---|
Locators | node-06.local, node-07.local |
Members | node-01.local, node-02.local,node-03.local, node-04.local, node-05.local |
We are now ready to conduct split-brain tests. The subsequent section presents a test case per network partition type. Each test case can be carried out independently. In each test case, we restart the cluster in a clean state.
In thist section, we carry out six (6) test cases to understand how Geode/GemFire behaves under each network partition type. By observing how the locators and members (cache servers) interact when a network partition occurs, we can deduce a pattern that can be used to identify the network partition type, which would aid us to quickly diagnose a system failure in real-life systems. Once we have determined the network partition type, we can then systematically recover from the failure.
This bundle includes the t_show_type
script that can be run at any time to check the cluster status. It scrapes the log files to determine the network partition type and provides the respective system recovery procedure. Our ultimate goal is to have this script to tell us what to do in case of any Geode/GemFire failure.
Single data node isloated, locators unreachable. In this test case, we split the network with a single data node unreachable by locators. i.e., Split A with a single data node and Split B with locators, as shown below.
Technically, this type does not represent a split-brain scenario. Since we have configured Geode/GemFire to store two (2) copies of data, if a single data node fails, Geode/GemFire will simply continue to serve the backup copy of data as if nothing happened. Nontheless, we include this type for completeness.
✏️ Note that if we have configured Geode/GemFire with a single copy of data, then a single data node failure leads to Type 1.
Split | Weight | VM Hosts |
---|---|---|
A | 15 (25%) | node-01 (lead) |
B | 46 (75%) | node-02, node-03, node-04, node-05, node-06, node-07 (locators) |
Restart the cluster and ingest data.
# Kill the cluster and clean all log and persistent files
kill_cluster -all
clean_cluster -all
# To make sure node-01 gets selected as the lead member, run 'start_cluster_lead_1' in 'bin_sh/'
cd_cluster sb/bin_sh
./start_cluster_lead_1
# Once the cluster is up, ingest data.
cd_app perf_test_sb/bin_sh
./test_group -run
# Check the entry count in each region to we just ingested data
./test_group -list
From pnode.local
, run split_cluster
as follows:
switch_cluster sb/bin_sh
# Replace 'enp0s8' with the correct interface. Run './split_cluster -?' to see the usage.
./split_cluster -type 0 enp0s8
Running t_show_membership_service_failure
shows no failures.
t_show_membership_service_failure
Member service failure
----------------------
Now: 2022/08/11 16:41:20 UTC
Running t_show_cluster_views
shows node-01 has crashed.
t_show_cluster_views
Ouptput:
Received Views
--------------
sb-locator-node-06.log: 2022/08/11 16:33:24.444 UTC
└── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|0
├── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
sb-locator-node-06.log: 2022/08/11 16:33:24.774 UTC
└── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
sb-locator-node-06.log: 2022/08/11 16:34:41.478 UTC
└── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
├── 192.168.56.11(sb-node-01:14786)<v2>:41000{lead}
├── 192.168.56.14(sb-node-04:8813)<v2>:41000
├── 192.168.56.13(sb-node-03:9180)<v2>:41000
├── 192.168.56.15(sb-node-05:9652)<v2>:41000
├── 192.168.56.12(sb-node-02:10403)<v2>:41000
sb-locator-node-06.log: 2022/08/11 16:37:43.901 UTC
└── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|3
├── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
├── 192.168.56.14(sb-node-04:8813)<v2>:41000{lead}
├── 192.168.56.13(sb-node-03:9180)<v2>:41000
├── 192.168.56.15(sb-node-05:9652)<v2>:41000
├── 192.168.56.12(sb-node-02:10403)<v2>:41000
crashed:
├── 192.168.56.11(sb-node-01:14786)<v2>:41000
sb-locator-node-07.log: 2022/08/11 16:33:24.805 UTC
└── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
sb-locator-node-07.log: 2022/08/11 16:34:41.510 UTC
└── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
├── 192.168.56.11(sb-node-01:14786)<v2>:41000{lead}
├── 192.168.56.14(sb-node-04:8813)<v2>:41000
├── 192.168.56.13(sb-node-03:9180)<v2>:41000
├── 192.168.56.15(sb-node-05:9652)<v2>:41000
├── 192.168.56.12(sb-node-02:10403)<v2>:41000
sb-locator-node-07.log: 2022/08/11 16:37:43.924 UTC
└── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|3
├── 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
├── 192.168.56.14(sb-node-04:8813)<v2>:41000{lead}
├── 192.168.56.13(sb-node-03:9180)<v2>:41000
├── 192.168.56.15(sb-node-05:9652)<v2>:41000
├── 192.168.56.12(sb-node-02:10403)<v2>:41000
crashed:
├── 192.168.56.11(sb-node-01:14786)<v2>:41000
Now: 2022/08/11 16:40:40 UTC
Running t_show_all_suspect_node_pairs
shows node-07 is a suspect for node-01. We know from the previous step, node-01 has crashed.
t_show_all_suspect_node_pairs
Suspect: node-01
---------------------------------------
Not found.
Suspect: node-02
---------------------------------------
Not found.
Suspect: node-03
---------------------------------------
Not found.
Suspect: node-04
---------------------------------------
Not found.
Suspect: node-05
---------------------------------------
Not found.
Suspect: node-06
---------------------------------------
Not found.
Suspect: node-07
---------------------------------------
For:
└── 192.168.56.11(sb-node-01:14786)
First logged:
2022/08/11 16:37:38.587 UTC
├── log: sb-locator-node-06.log
├── from: 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
├── for: 192.168.56.11(sb-node-01:14786)<v2>:41000
└── msg: Member isn't responding to heartbeat requests
Last logged:
2022-08-11 16:37:38
├── log: sb-node-05.log
├── from: 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
├── for: 192.168.56.11(sb-node-01:14786)<v2>:41000
└── msg: Member isn't responding to heartbeat requests
Suspect: myself
---------------------------------------
For:
└── 192.168.56.11(sb-node-01:14786)
First logged:
2022/08/11 16:37:38.602 UTC
├── log: sb-locator-node-07.log
├── from: myself
├── for: 192.168.56.11(sb-node-01:14786)<v2>:41000
└── msg: Member isn't responding to heartbeat requests
Last logged:
2022/08/11 16:37:38.602 UTC
├── log: sb-locator-node-07.log
├── from: myself
├── for: 192.168.56.11(sb-node-01:14786)<v2>:41000
└── msg: Member isn't responding to heartbeat requests
Now: 2022/08/11 16:42:09 UTC
t_show_all_unexpectedly_left_members
Output:
Member unexpectedly left:
-------------------------
sb-locator-node-06.log
├── 2022/08/11 16:37:43.908 UTC
│ └── 192.168.56.11(sb-node-01:14786)<v2>:41000
sb-locator-node-07.log
├── 2022/08/11 16:37:43.934 UTC
│ └── 192.168.56.11(sb-node-01:14786)<v2>:41000
sb-node-02.log
├── 2022-08-11 16:37:43
│ └── 192.168.56.11(sb-node-01:14786)<v2>:41000
sb-node-03.log
├── 2022-08-11 16:37:43
│ └── 192.168.56.11(sb-node-01:14786)<v2>:41000
sb-node-04.log
├── 2022-08-11 16:37:43
│ └── 192.168.56.11(sb-node-01:14786)<v2>:41000
sb-node-05.log
├── 2022-08-11 16:37:43
│ └── 192.168.56.11(sb-node-01:14786)<v2>:41000
Now: 2022/08/11 16:43:50 UTC
Running test_group -list
shows all the regions have the correct entry counts.
cd_app perf_test_sb/bin_sh
./test_group -list
...
group: g1
- name: partition_persistent_overflow
data: Region
size: 10000
cleared: false
- name: replicate_overflow
data: Region
size: 10000
cleared: false
- name: partition_persistent
data: Region
size: 10000
cleared: false
- name: partition
data: Region
size: 10000
cleared: false
- name: partition_overflow
data: Region
size: 10000
cleared: false
- name: replicate_persistent_overflow
data: Region
size: 10000
cleared: false
- name: replicate
data: Region
size: 10000
cleared: false
- name: replicate_persistent
data: Region
size: 10000
cleared: false
...
Let's now merge Split A and Split B by running merge_cluster
.
cd_cluster sb/bin_sh
./merge_cluster
Running t_show_member_join_requests
does not show a join request from node-01.
t_show_member_join_requests
Output:
Member join requests:
---------------------
sb-locator-node-06.log
├── 2022/08/11 16:33:23.671 UTC
│ └── 192.168.56.17(sb-locator-node-07:14864:locator)<ec>:41000
├── 2022/08/11 16:34:41.170 UTC
│ └── 192.168.56.11(sb-node-01:14786):41000
├── 2022/08/11 16:34:41.175 UTC
│ └── 192.168.56.14(sb-node-04:8813):41000
├── 2022/08/11 16:34:41.199 UTC
│ └── 192.168.56.13(sb-node-03:9180):41000
├── 2022/08/11 16:34:41.218 UTC
│ └── 192.168.56.15(sb-node-05:9652):41000
├── 2022/08/11 16:34:41.234 UTC
│ └── 192.168.56.12(sb-node-02:10403):41000
Now: 2022/08/11 16:42:53 UTC
We now check the node-01 log file for its status. It is repeatedly sending join requests but it is not receving responses. You can also look at the locator log files to see if they are logging node-01 join requests but you will find nothing pertaining to node-01 requests.
show_log -num 1
Output:
...
2022-08-11 16:47:44 INFO Services:330 - Discovery state after looking for membership coordinator is locatorsContacted=2; findInViewResponses=0; alreadyTried=[192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000]; registrants=[192.168.56.11(sb-node-01:14786):41000, 192.168.56.11(sb-node-01:14786):41000, 192.168.56.11(sb-node-01:14786):41000, 192.168.56.11(sb-node-01:14786):41000, 192.168.56.11(sb-node-01:14786):41000]; possibleCoordinator=192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000; viewId=3; hasContactedAJoinedLocator=true; view=View[192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000|3] members: [192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000, 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000, 192.168.56.14(sb-node-04:8813)<v2>:41000{lead}, 192.168.56.13(sb-node-03:9180)<v2>:41000, 192.168.56.15(sb-node-05:9652)<v2>:41000, 192.168.56.12(sb-node-02:10403)<v2>:41000] crashed: [192.168.56.11(sb-node-01:14786)<v2>:41000]; responses=[]
2022-08-11 16:47:44 INFO Services:333 - found possible coordinator 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000
2022-08-11 16:47:44 INFO Services:431 - Probable coordinator is still 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000 - waiting for a join-response
2022-08-11 16:47:56 INFO Services:287 - Stopping membership services
2022-08-11 16:47:56 INFO Services:733 - GMSHealthMonitor server thread exiting
2022-08-11 16:47:56 WARN InternalDistributedSystem:2616 - Caught SystemConnectException in reconnect
org.apache.geode.SystemConnectException: Unable to join the distributed system in 60400ms
at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:184) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:220) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:756) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:132) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3013) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:282) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2584) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2403) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1255) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2323) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1296) ~[geode-membership-9.10.13.jar:?]
at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:2039) ~[geode-membership-9.10.13.jar:?]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_333]
2022-08-11 16:47:56 INFO InternalDistributedSystem:2537 - Disconnecting old DistributedSystem to prepare for a reconnect attempt
Since we have not lost data, we can safely restart node-01 as follows.
stop_member -num 1
show_cluster
start_member -num 1
Running t_show_member_join_requests
shows node-01 is now able to rejoin the cluster.
t_show_member_join_requests
Output:
Member join requests:
---------------------
sb-locator-node-06.log
├── 2022/08/11 16:33:23.671 UTC
│ └── 192.168.56.17(sb-locator-node-07:14864:locator)<ec>:41000
├── 2022/08/11 16:34:41.170 UTC
│ └── 192.168.56.11(sb-node-01:14786):41000
├── 2022/08/11 16:34:41.175 UTC
│ └── 192.168.56.14(sb-node-04:8813):41000
├── 2022/08/11 16:34:41.199 UTC
│ └── 192.168.56.13(sb-node-03:9180):41000
├── 2022/08/11 16:34:41.218 UTC
│ └── 192.168.56.15(sb-node-05:9652):41000
├── 2022/08/11 16:34:41.234 UTC
│ └── 192.168.56.12(sb-node-02:10403):41000
├── 2022/08/11 16:58:30.237 UTC
│ └── 192.168.56.11(sb-node-01:16784):41000
Now: 2022/08/11 17:01:37 UTC
gfsh
also shows node-01 in the cluster.
gfsh>list members
Member Count : 7
Name | Id
------------------ | ---------------------------------------------------------------------------
sb-node-01 | 192.168.56.11(sb-node-01:16784)<v4>:41000
sb-node-02 | 192.168.56.12(sb-node-02:10403)<v2>:41000
sb-node-03 | 192.168.56.13(sb-node-03:9180)<v2>:41000
sb-node-04 | 192.168.56.14(sb-node-04:8813)<v2>:41000
sb-node-05 | 192.168.56.15(sb-node-05:9652)<v2>:41000
sb-locator-node-06 | 192.168.56.16(sb-locator-node-06:17462:locator)<ec><v0>:41000 [Coordinator]
sb-locator-node-07 | 192.168.56.17(sb-locator-node-07:14864:locator)<ec><v1>:41000
- This test case split the network to isolate one (1) member (in Split A) from reaching the locators and the remaining members.
- We found that the coordinator quickly detects the unreachable member.
- We saw no data loss and the cluster continues to run without any problems.
- When we merged the splits, we saw the isolated member repeatedly sending join requests to the coordinator, but they are ignored. The isolated member is not able to rejoin the cluster.
- We had to restart the isolated member for it to rejoined the cluster.
Data nodes isloated, locators unreachable. In this test case, we split the network with some data nodes unreachable by locators. i.e., Split A without locators and Split B with locators, as shown below.
Split | Weight | VM Hosts |
---|---|---|
A | 25 (41%) | node-01 (lead), node-02 |
B | 36 (59%) | node-03, node-04, node-05, node-06, node-07 (locators) |
Restart the cluster and ingest data.
# Kill the cluster and clean all log and persistent files
kill_cluster -all
clean_cluster -all
# To make sure node-01 gets selected as the lead member, run 'start_cluster_lead_1' in 'bin_sh/'
cd_cluster sb/bin_sh
./start_cluster_lead_1
# Once the cluster is up, ingest data.
cd_app perf_test_sb/bin_sh
./test_group -run
# Check the entry count in each region to which we just ingested data
./test_group -list
From pnode.local
, run split_cluster
as follows:
switch_cluster sb/bin_sh
# Replace 'enp0s8' with the correct interface. Run './split_cluster -?' to see the usage.
./split_cluster -type 1 enp0s8
To see the iptables
rules set by split_cluster
, run list_rules
as follows:
./list_rules
Output:
...
node-01.local
-------------
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP all -- 192.168.56.14 anywhere
DROP all -- 192.168.56.15 anywhere
DROP all -- 192.168.56.16 anywhere
DROP all -- 192.168.56.17 anywhere
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
node-02.local
-------------
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP all -- 192.168.56.14 anywhere
DROP all -- 192.168.56.15 anywhere
DROP all -- 192.168.56.16 anywhere
DROP all -- 192.168.56.17 anywhere
...
From pnode.local
, monitor the log files to see the cluster being split into two (2). The t_show_cluster_views
scrapes the log files to build a complete timeline of the cluster views.
t_show_cluster_views -long
Output:
...
sb-locator-node-07-01.log:[info 2022/07/18 23:25:54.892 UTC sb-locator-node-07-01 <unicast receiver,node-07-45188> tid=0x1e]
└── 192.168.56.16(sb-locator-node-06-01:3588:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06-01:3588:locator)<ec><v0>:41000,
├── 192.168.56.17(sb-locator-node-07-01:8554:locator)<ec><v1>:41000,
├── 192.168.56.14(sb-node-04:5707)<v2>:41000{lead},
├── 192.168.56.12(sb-node-02:9901)<v2>:41000,
├── 192.168.56.13(sb-node-03:5278)<v2>:41000,
├── 192.168.56.15(sb-node-05:5940)<v2>:41000,
└── 192.168.56.11(sb-node-01:21411)<v2>:41000
To see the suspect nodes that are not responding, we can run t_show_all_suspect_node_pairs
, which displays the last log messages of the suspects.
t_show_all_suspect_node_pairs
Output:
Suspect: node-01
---------------------------------------
Not found.
Suspect: node-02
---------------------------------------
For:
192.168.56.13(sb-node-03:26954)
192.168.56.14(sb-node-04:27432)
192.168.56.15(sb-node-05:27669)
192.168.56.16(sb-locator-node-06-01:27290:locator)
192.168.56.17(sb-locator-node-07-01:31149:locator)
First logged:
sb-locator-node-06-01.log:[info 2022/07/23 17:52:50.288 UTC sb-locator-node-06-01 <unicast receiver,node-06-18485> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.13(sb-node-03:26954)<v2>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/23 17:52:52.745 UTC sb-locator-node-06-01 <unicast receiver,node-06-18485> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.15(sb-node-05:27669)<v2>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/23 17:52:55.273 UTC sb-locator-node-06-01 <unicast receiver,node-06-18485> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.14(sb-node-04:27432)<v2>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/23 17:53:02.746 UTC sb-locator-node-06-01 <unicast receiver,node-06-18485> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/23 17:53:10.247 UTC sb-locator-node-06-01 <unicast receiver,node-06-18485> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000: Member isn't responding to heartbeat requests
Last logged:
sb-node-05.log:2022-07-23 17:52:50 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.13(sb-node-03:26954)<v2>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-23 17:52:57 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.15(sb-node-05:27669)<v2>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-23 17:53:00 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.14(sb-node-04:27432)<v2>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-23 17:53:07 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-23 17:53:15 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:32593)<v2>:41000 for 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000: Member isn't responding to heartbeat requests
Suspect: node-03
---------------------------------------
Not found.
Suspect: node-04
---------------------------------------
Not found.
Suspect: node-05
---------------------------------------
Not found.
Suspect: node-06
---------------------------------------
Not found.
Suspect: node-07
---------------------------------------
Not found.
You can seee from the above output that node-01 and node-02 are suspects. They are not responding to the remaining data nodes and locators. You can also see Pulse getting inundated with the "Member Suspected" messages.
After a few minutes, the locators log "Membership service failure" and proceed to restart themselves. You can monitor these messages by running t_show_membership_service_failure
.
The output below shows that the locator on node-06 detects the network partition in 26 seconds. It logged the network partition message at 17:53:16.569
and the first suspect message was logged at 17:52:50.288
as shown in the previous ouput.
t_show_membership_service_failure
Output:
Member service failure
──────────────────────
sb-locator-node-06-01.log
├── [fatal 2022/07/23 17:53:16.569 UTC
│ └── Membership service failure: Exiting due to possible network partition event due to loss of 4 cache processes:
│ ├── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000,
│ ├── 192.168.56.13(sb-node-03:26954)<v2>:41000,
│ ├── 192.168.56.14(sb-node-04:27432)<v2>:41000,
│ ├── 192.168.56.15(sb-node-05:27669)<v2>:41000
├── 2022/07/23 17:55:49.779 UTC
│ └── system restarted
sb-locator-node-07-01.log
├── [fatal 2022/07/23 17:53:15.390 UTC
│ └── Membership service failure: Member isn't responding to heartbeat requests
├── 2022/07/23 17:55:50.205 UTC
│ └── system restarted
When the locators restart, the locators publish the new cluster view. We can see a complete timeline of cluster views by running t_show_cluster_views
as shown below. The output indicates that the cluster view has changed from five (5) data nodes at 2022/07/23 17:50:45.721 UTC
to three (3) at 2022/07/23 17:55:34.231 UTC
. It took 4:49 minutes to resolve the network partition.
We can also see from the above output, that most of the time is spent restarting the system. The locator on node-01 detects a network partition at 17:53:16.569
and logs "system restarted" at 17:55:49.779
. It took 2:33 minutes to restart the entire system (or cluster).
t_show_cluster_views -long
Output:
sb-locator-node-06-01.log:[info 2022/07/23 17:49:30.893 UTC sb-locator-node-06-01 <main> tid=0x1]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|0
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000
sb-locator-node-06-01.log:[info 2022/07/23 17:49:32.154 UTC sb-locator-node-06-01 <Geode Membership View Creator> tid=0x25]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000,
└── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000
sb-locator-node-06-01.log:[info 2022/07/23 17:50:45.567 UTC sb-locator-node-06-01 <Geode Membership View Creator> tid=0x25]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000,
├── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000,
├── 192.168.56.11(sb-node-01:12242)<v2>:41000{lead},
├── 192.168.56.12(sb-node-02:32593)<v2>:41000,
├── 192.168.56.13(sb-node-03:26954)<v2>:41000,
├── 192.168.56.15(sb-node-05:27669)<v2>:41000,
└── 192.168.56.14(sb-node-04:27432)<v2>:41000
sb-locator-node-06-01.log:[info 2022/07/23 17:55:33.803 UTC sb-locator-node-06-01 <ReconnectThread> tid=0xa3]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|0
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000
sb-locator-node-06-01.log:[info 2022/07/23 17:55:34.114 UTC sb-locator-node-06-01 <Geode Membership View Creator> tid=0xba]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000,
├── 192.168.56.13(sb-node-03:26954)<v1>:41000{lead},
├── 192.168.56.15(sb-node-05:27669)<v1>:41000,
├── 192.168.56.14(sb-node-04:27432)<v1>:41000,
└── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000
sb-locator-node-07-01.log:[info 2022/07/23 17:49:32.271 UTC sb-locator-node-07-01 <unicast receiver,node-07-3671> tid=0x1e]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000,
└── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000
sb-locator-node-07-01.log:[info 2022/07/23 17:50:45.721 UTC sb-locator-node-07-01 <unicast receiver,node-07-3671> tid=0x1e]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000,
├── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000,
├── 192.168.56.11(sb-node-01:12242)<v2>:41000{lead},
├── 192.168.56.12(sb-node-02:32593)<v2>:41000,
├── 192.168.56.13(sb-node-03:26954)<v2>:41000,
├── 192.168.56.15(sb-node-05:27669)<v2>:41000,
└── 192.168.56.14(sb-node-04:27432)<v2>:41000
sb-locator-node-07-01.log:[info 2022/07/23 17:55:34.231 UTC sb-locator-node-07-01 <unicast receiver,node-07-3671> tid=0x1e]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000,
├── 192.168.56.13(sb-node-03:26954)<v1>:41000{lead},
├── 192.168.56.15(sb-node-05:27669)<v1>:41000,
├── 192.168.56.14(sb-node-04:27432)<v1>:41000,
└── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000
Pulse also shows the cluster with three (3) data nodes and two (2) locators. Note that the "Suspect" messages no longer show in the right pane since the cluster has restarted.
Running the test_group
client throws an exception. It is unabled to connect to Split B.
cd_app perf_test_sb/bin_sh
./test_group -list
Looking at the Split B server logs, we see they are waiting for other menbers to recover.
show_log -num 3
Output:
...
22-07-23 17:56:30 INFO StartupStatus:53 - Region /partition_persistent (and any colocated sub-regions) has potentially stale data. Buckets [0] are waiting for another offline member to recover the latest data. My persistent id is:
DiskStore ID: c994aa34-f5c9-4806-8b1f-acced13a7f69
Name: sb-node-04
Location: /192.168.56.14:/home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-node-04/.
Offline members with potentially new data:[
DiskStore ID: b003f335-1ff2-4e8b-9916-950b7821ba7c
Location: /192.168.56.11:/home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-node-01/.
Buckets: [0]
]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
Looking at the Split A server logs, we see different messages. They are unable to connect to the locators.
...
2022-07-23 17:57:34 INFO Services:201 - Starting membership services
2022-07-23 17:57:34 INFO Services:575 - Established local address 192.168.56.11(sb-node-01:7202):41000
2022-07-23 17:57:34 INFO Services:406 - JGroups channel reinitialized (took 1ms)
2022-07-23 17:57:34 INFO DirectChannel:148 - GemFire P2P Listener started on /192.168.56.11:44405
2022-07-23 17:57:34 INFO Services:432 - Delivering 0 messages queued by quorum checker
2022-07-23 17:57:34 INFO Services:714 - Started failure detection server thread on /192.168.56.11:42838.
2022-07-23 17:57:34 INFO Services:1226 - Unable to contact locator HostAndPort[node-06.local:10334]: java.net.UnknownHostException: node-06.local
2022-07-23 17:57:34 INFO Services:1226 - Unable to contact locator HostAndPort[node-07.local:10334]: java.net.UnknownHostException: node-07.local
2022-07-23 17:57:34 INFO Services:345 - Discovery state after looking for membership coordinator is locatorsContacted=0; findInViewResponses=0; alreadyTried=[]; registrants=[]; possibleCoordinator=null; viewId=-100; hasContactedAJoinedLocator=false; view=null; responses=[]
2022-07-23 17:57:34 INFO Services:287 - Stopping membership services
2022-07-23 17:57:34 INFO Services:742 - GMSHealthMonitor server thread exiting
2022-07-23 17:57:34 WARN InternalDistributedSystem:2617 - Caught GemFireConfigException in reconnect
org.apache.geode.GemFireConfigException: Problem configuring membership services
...
Caused by: java.net.UnknownHostException: node-07.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:196) ~[?:1.8.0_333]
...
2022-07-23 17:57:34 INFO InternalDistributedSystem:2535 - Disconnecting old DistributedSystem to prepare for a reconnect attempt
Running gfsh
shows the members in Split B are inact but we are unable to check the persistent status.
✏️ You won't be able to connect to the cluster (Split B) via gfsh
until a quorum (Split B) is established. This will take a few minutes. You can run t_show_cluster_views
or t_show_membership_service_failure
to check the cluster status in the meantime.
gfsh
gfsh>connect --url=http://node-07.local:7070/geode-mgmt/v1
Successfully connected to: GemFire Manager HTTP service @ http://node-07.local:7070/geode-mgmt/v1
You are connected to a cluster of version: 1.14.4
gfsh>list members
Member Count : 5
Name | Id
--------------------- | ------------------------------------------------------------------------------
sb-node-03 | 192.168.56.13(sb-node-03:8488)<v1>:41000
sb-node-04 | 192.168.56.14(sb-node-04:8590)<v1>:41000
sb-node-05 | 192.168.56.15(sb-node-05:8603)<v1>:41000
sb-locator-node-06-01 | 192.168.56.16(sb-locator-node-06-01:10025:locator)<ec><v0>:41000 [Coordinator]
sb-locator-node-07-01 | 192.168.56.17(sb-locator-node-07-01:8826:locator)<ec><v1>:41000
gfsh>describe region --name=/partition
Name : partition
Data Policy : partition
Hosting Members : sb-node-03
sb-node-04
sb-node-05
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
------ | ----------- | ---------
Region | size | 0
| data-policy | PARTITION
gfsh>describe region --name=/partition_overflow
Name : partition_overflow
Data Policy : partition
Hosting Members : sb-node-03
sb-node-04
sb-node-05
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
-------- | ------------------ | -------------------
Region | size | 0
| data-policy | PARTITION
Eviction | eviction-action | overflow-to-disk
| eviction-algorithm | lru-heap-percentage
gfsh>describe region --name=/partition_persistent
The last line above hangs indefinitely. Gfsh is unable to access the persistent regions. Note that unlike the non-persistent regions, the persistent regions are not empty, which can be verified from Pulse.
Running test_group -list
shows that it is also unable to connect to the cluster. Split B is inaccessible.
cd_app perf_test_sb/bin_sh
./test_grup -list
Output:
...
Exception in thread "main" org.apache.geode.cache.client.NoAvailableServersException
...
We now need to decide whether to recover the cluster with only Split B and abandon Split A, or try fixing the network to receover from both Split A dn B. Let's look at both cases.
Let's now merge Split A and Split B by running merge_cluster
.
cd_cluster sb/bin_sh
./merge_cluster
If you have gfsh
still running in the previous step, then it finally returns with the correct entry count in the persistent region.
We still have node-01 and node-02 (in Split A) running. Now that the network has merged, these nodes are able to reach the locators. They immediately send "join requests" to one of the locators, which can be seen in the locator log file. Run t_show_member_join_requests
to see the messages.
t_show_member_join_requests
Output:
Member join requests:
─────────────────────
sb-locator-node-06-01.log
├── 2022/07/23 17:49:31.840 UTC
│ └── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec>:41000
├── 2022/07/23 17:50:45.256 UTC
│ └── 192.168.56.11(sb-node-01:12242):41000
├── 2022/07/23 17:50:45.346 UTC
│ └── 192.168.56.12(sb-node-02:32593):41000
├── 2022/07/23 17:50:45.352 UTC
│ └── 192.168.56.13(sb-node-03:26954):41000
├── 2022/07/23 17:50:45.392 UTC
│ └── 192.168.56.15(sb-node-05:27669):41000
├── 2022/07/23 17:50:45.436 UTC
│ └── 192.168.56.14(sb-node-04:27432):41000
├── 2022/07/23 17:54:21.690 UTC
│ └── 192.168.56.13(sb-node-03:26954):41000
├── 2022/07/23 17:54:21.691 UTC
│ └── 192.168.56.15(sb-node-05:27669):41000
├── 2022/07/23 17:54:21.692 UTC
│ └── 192.168.56.14(sb-node-04:27432):41000
├── 2022/07/23 17:54:21.693 UTC
│ └── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec>:41000
├── 2022/07/23 18:27:00.996 UTC
│ └── 192.168.56.12(sb-node-02:32593):41000
├── 2022/07/23 18:27:03.096 UTC
│ └── 192.168.56.11(sb-node-01:12242):41000
If you run t_show_cluster_views
, it should display the latest cluster view with node-01 and node-02 now belonging to the cluster.
t_show_cluster_views -long
Output:
...
sb-locator-node-07-01.log:[info 2022/07/23 18:27:03.580 UTC sb-locator-node-07-01 <unicast receiver,node-07-3671> tid=0x1e]
└── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000|3
├── 192.168.56.16(sb-locator-node-06-01:27290:locator)<ec><v0>:41000,
├── 192.168.56.13(sb-node-03:26954)<v1>:41000{lead},
├── 192.168.56.15(sb-node-05:27669)<v1>:41000,
├── 192.168.56.14(sb-node-04:27432)<v1>:41000,
├── 192.168.56.17(sb-locator-node-07-01:31149:locator)<ec><v1>:41000,
├── 192.168.56.12(sb-node-02:32593)<v2>:41000,
└── 192.168.56.11(sb-node-01:12242)<v3>:41000
Running test_group -list
shows that we lost data in the non-persistent regions.
./test_group -list
Output:
...
group: g1
- name: replicate_persistent_overflow
data: Region
size: 10000
cleared: false
- name: partition_persistent
data: Region
size: 10000
cleared: false
- name: partition
data: Region
size: 0
cleared: false
- name: replicate
data: Region
size: 0
cleared: false
- name: replicate_persistent
data: Region
size: 10000
cleared: false
- name: partition_overflow
data: Region
size: 0
cleared: false
- name: partition_persistent_overflow
data: Region
size: 10000
cleared: false
- name: replicate_overflow
data: Region
size: 0
cleared: false
...
✏️ To run this test, you will need to restart the cluster if you have already merged the splits. Please follow the restart steps shown in the beginning of this test case.
If we choose not to merge the splits, then you can revoke missing disk stores so that the members in Split B can join the cluster. We would lose data held by the members in Split A, however. The following gfsh
output shows the outcome of revoking disk stores.
✏️ You may not be able to connect via gfsh
until the cluster has been auto-restarted, which may take a few minutes. You can run t_show_membership_service_failure
to check the restart status.
gfsh
gfsh>connect --url=http://node-07.local:7070/geode-mgmt/v1
Successfully connected to: GemFire Manager HTTP service @ http://node-07.local:7070/geode-mgmt/v1
You are connected to a cluster of version: 1.14.4
gfsh>show missing-disk-stores
Missing Disk Stores
Disk Store ID | Host | Directory
------------------------------------ | ------------- | -------------------------------------------------------------------------------------------------
0645d21f-9b9c-4d46-a0ef-7280907f4e17 | 192.168.56.12 | /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-node-02/.
No missing colocated region found
gfsh>revoke missing-disk-store --id=0645d21f-9b9c-4d46-a0ef-7280907f4e17
Missing disk store successfully revoked
gfsh>show missing-disk-stores
Missing Disk Stores
Disk Store ID | Host | Directory
------------------------------------ | ------------- | -------------------------------------------------------------------------------------------------
045b20b7-cc15-4cd3-97c3-3b7f79042eb4 | 192.168.56.11 | /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-node-01/.
No missing colocated region found
gfsh>revoke missing-disk-store --id=045b20b7-cc15-4cd3-97c3-3b7f79042eb4
Missing disk store successfully revoked
gfsh>show missing-disk-stores
No missing disk store found
No missing colocated region found
gfsh>describe region --name=/partition_persistent
Name : partition_persistent
Data Policy : persistent partition
Hosting Members : sb-node-03
sb-node-04
sb-node-05
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
------ | ----------- | --------------------
Region | size | 6018
| data-policy | PERSISTENT_PARTITION
gfsh>describe region --name=/partition_persistent_overflow
Name : partition_persistent_overflow
Data Policy : persistent partition
Hosting Members : sb-node-03
sb-node-04
sb-node-05
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
-------- | ------------------ | --------------------
Region | size | 5946
| data-policy | PERSISTENT_PARTITION
Eviction | eviction-action | overflow-to-disk
| eviction-algorithm | lru-heap-percentage
Let's try merging the splits with the Split A members still running.
cd_cluster sb/bin_sh
./merge_cluster
Now take a look at the Split A member log files. We can see from the output below that these members are not allowed to connect to the cluster with the revoked data stores. In order for them to join the cluster, their disk stores must be removed.
show_log
Output:
...
2022-07-31 19:56:08 ERROR InternalCacheBuilder:203 - org.apache.geode.cache.persistence.RevokedPersistentDataException: The persistent member id 0645d21f-9b9c-4d46-a0ef-7280907f4e17 has been revoked in this distributed system. You cannot recover from disk files which have been revoked.
...
- This test case split the network to isolate two (2) members (in Split A) from reaching the locators and the remaining members.
- We found that the coordinator quickly detects the unreachable members and label them as suspects.
- After 26 seconds, the coordinator logs a possible network partition. It then proceeds to restart the cluster.
- It took nearly five (5) minutes for Geode/GemFire to resolve the network partition by restarting the entire cluster.
- With the cluster restarted, there are now only three (3) members as expected. All non-persistent data is lost and clients are unable to connect to the split cluster (Split B).
gfsh
is able to connect to Split B but partition persistent regions are not accessible. Thedescribe region
hangs indefinitely on a partition persistent region. It eventually returns when the splits are merged.- Upon merging the splits, we saw the network-isolated members (in Split A) automatically rejoin the cluster, but without non-persistent data as expected.
- The cluster has completely recovered the persistent data.
- If we were to reinstate the cluster only with the members in the quorum, i.e, Split B members, then we needed to revoke disk stores, which restored the cluster without the data held by the Split A members.
- Merging the splits after we have revoked disk stores did not allow the Split A members to rejoin the cluster. We needed to remove their disk stores before they can rejoin the cluster. We have lost the data held by Split A members in that case.
Locators divided. In this test case, we split the network with locators unreachable by each other, i.e., Split A and B each with a locator, as shown below.
Split | Weight | VM Hosts |
---|---|---|
A | 28 (46%) | node-01 (lead), node-02, node-06 (locator) |
B | 33 (54%) | node-03, node-04, node-05, node-07 (locator) |
Restart the cluster and ingest data.
# Kill the cluster and clean all log and persistent files
kill_cluster -all
clean_cluster -all
# To make sure node-01 gets selected as the lead member, run 'start_cluster_lead_1' in 'bin_sh/'
cd_cluster sb/bin_sh
./start_cluster_lead_1
# Once the cluster is up, ingest data.
cd_app perf_test_sb/bin_sh
./test_group -run
# Check the entry count in each region to which we just ingested data
./test_group -list
From pnode.local
, run split_cluster
as follows:
switch_cluster sb/bin_sh
# Replace 'enp0s8' with the correct interface. Run './split_cluster -?' to see the usage.
./split_cluster -type 2 enp0s8
Pulse immediately reports suspects and freezes. It is no longer usable.
Running t_show_all_suspect_node_pairs
shows that the node-06 locator is a suspect for node-03, node-04, node-05, and node-07. We can derive from this log information that the node-07 locator has a quorum to become the new coordinator.
t_show_all_suspect_node_pairs
Output:
Suspect: node-01
---------------------------------------
Not found.
Suspect: node-02
---------------------------------------
For:
192.168.56.13(sb-node-03:1067)
192.168.56.14(sb-node-04:1152)
192.168.56.15(sb-node-05:1715)
First logged:
sb-locator-node-06-01.log:[info 2022/07/24 15:22:44.298 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:46.794 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.14(sb-node-04:1152)<v5>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:49.294 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.15(sb-node-05:1715)<v6>:41000: Member isn't responding to heartbeat requests
Last logged:
sb-node-05.log:2022-07-24 15:22:44 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-24 15:22:52 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.14(sb-node-04:1152)<v5>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-24 15:22:54 INFO Services:1196 - received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.15(sb-node-05:1715)<v6>:41000: Member isn't responding to heartbeat requests
Suspect: node-03
---------------------------------------
Not found.
Suspect: node-04
---------------------------------------
For:
192.168.56.13(sb-node-03:1067)
192.168.56.14(sb-node-04:1152)
192.168.56.15(sb-node-05:1715)
First logged:
sb-locator-node-06-01.log:[info 2022/07/24 15:22:44.298 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:46.794 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.14(sb-node-04:1152)<v5>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:49.294 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.15(sb-node-05:1715)<v6>:41000: Member isn't responding to heartbeat requests
sb-locator-node-07-01.log:[info 2022/07/24 15:22:53.358 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.14(sb-node-04:1152)<v5>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: member unexpectedly shut down shared, unordered connection
Last logged:
sb-node-05.log:2022-07-24 15:22:53 INFO Services:1196 - received suspect message from 192.168.56.14(sb-node-04:1152)<v5>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: member unexpectedly shut down shared, unordered connection
Suspect: node-05
---------------------------------------
For:
192.168.56.13(sb-node-03:1067)
192.168.56.14(sb-node-04:1152)
192.168.56.15(sb-node-05:1715)
First logged:
sb-locator-node-06-01.log:[info 2022/07/24 15:22:44.298 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:46.794 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.14(sb-node-04:1152)<v5>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:49.294 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.15(sb-node-05:1715)<v6>:41000: Member isn't responding to heartbeat requests
sb-locator-node-07-01.log:[info 2022/07/24 15:22:53.358 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.14(sb-node-04:1152)<v5>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: member unexpectedly shut down shared, unordered connection
sb-locator-node-07-01.log:[info 2022/07/24 15:22:53.870 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.15(sb-node-05:1715)<v6>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: member unexpectedly shut down shared, unordered connection
Last logged:
sb-node-04.log:2022-07-24 15:22:53 INFO Services:1196 - received suspect message from 192.168.56.15(sb-node-05:1715)<v6>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: member unexpectedly shut down shared, unordered connection
Suspect: node-06
---------------------------------------
For:
192.168.56.13(sb-node-03:1067)
192.168.56.14(sb-node-04:1152)
192.168.56.15(sb-node-05:1715)
192.168.56.17(sb-locator-node-07-01:6963:locator)
First logged:
sb-locator-node-06-01.log:[info 2022/07/24 15:22:44.298 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:46.794 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.14(sb-node-04:1152)<v5>:41000: Member isn't responding to heartbeat requests
sb-locator-node-06-01.log:[info 2022/07/24 15:22:49.294 UTC sb-locator-node-06-01 <unicast receiver,node-06-24628> tid=0x1e] received suspect message from 192.168.56.12(sb-node-02:8275)<v3>:41000 for 192.168.56.15(sb-node-05:1715)<v6>:41000: Member isn't responding to heartbeat requests
sb-locator-node-07-01.log:[info 2022/07/24 15:22:45.780 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000 for 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v1>:41000: Member isn't responding to heartbeat requests
sb-locator-node-07-01.log:[info 2022/07/24 15:22:50.782 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000 for 192.168.56.14(sb-node-04:1152)<v5>:41000: Member isn't responding to heartbeat requests
sb-locator-node-07-01.log:[info 2022/07/24 15:22:53.284 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000 for 192.168.56.15(sb-node-05:1715)<v6>:41000: Member isn't responding to heartbeat requests
sb-locator-node-07-01.log:[info 2022/07/24 15:22:53.358 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.14(sb-node-04:1152)<v5>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: member unexpectedly shut down shared, unordered connection
sb-locator-node-07-01.log:[info 2022/07/24 15:22:53.870 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f] received suspect message from 192.168.56.15(sb-node-05:1715)<v6>:41000 for 192.168.56.13(sb-node-03:1067)<v4>:41000: member unexpectedly shut down shared, unordered connection
Last logged:
sb-node-05.log:2022-07-24 15:22:45 INFO Services:1196 - received suspect message from 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000 for 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v1>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-24 15:22:50 INFO Services:1196 - received suspect message from 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000 for 192.168.56.14(sb-node-04:1152)<v5>:41000: Member isn't responding to heartbeat requests
sb-node-05.log:2022-07-24 15:22:53 INFO Services:1196 - received suspect message from 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000 for 192.168.56.15(sb-node-05:1715)<v6>:41000: Member isn't responding to heartbeat requests
Suspect: node-07
---------------------------------------
Not found.
After a few minutes, running t_show_cluster_views
shows that the node-07 locator became the new coordinator as predicted in the previous step.
t_show_cluster_views -long
Output:
...
sb-locator-node-07-01.log:[info 2022/07/24 15:21:09.634 UTC sb-locator-node-07-01 <unicast receiver,node-07-41754> tid=0x1f]
└── 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000|6
├── 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000,
├── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v1>:41000,
├── 192.168.56.11(sb-node-01:21418)<v2>:41000{lead},
├── 192.168.56.12(sb-node-02:8275)<v3>:41000,
├── 192.168.56.13(sb-node-03:1067)<v4>:41000,
├── 192.168.56.14(sb-node-04:1152)<v5>:41000,
└── 192.168.56.15(sb-node-05:1715)<v6>:41000
sb-locator-node-07-01.log:[info 2022/07/24 15:26:55.151 UTC sb-locator-node-07-01 <ReconnectThread> tid=0x83]
└── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v0>:41000|0
└── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v0>:41000
sb-locator-node-07-01.log:[info 2022/07/24 15:26:55.465 UTC sb-locator-node-07-01 <Geode Membership View Creator> tid=0xa4]
└── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v0>:41000|1
├── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v0>:41000,
├── 192.168.56.13(sb-node-03:1067)<v1>:41000{lead},
├── 192.168.56.15(sb-node-05:1715)<v1>:41000,
└── 192.168.56.14(sb-node-04:1152)<v1>:41000
Running t_show_membership_service_failure
after a few minutes shows the fatal membership service failure error and the system has restarted. As in Type 1, once again, the cluster restarted and we lost the non-persistent data.
t_show_membership_service_failure
Output:
Member service failure
----------------------
sb-locator-node-06-01.log
├── [fatal 2022/07/24 15:22:55.307 UTC
│ └── Membership service failure: Exiting due to possible network partition event due to loss of 4 cache processes:
│ ├── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v1>:41000,
│ ├── 192.168.56.13(sb-node-03:1067)<v4>:41000,
│ ├── 192.168.56.14(sb-node-04:1152)<v5>:41000,
│ ├── 192.168.56.15(sb-node-05:1715)<v6>:41000
sb-locator-node-07-01.log
├── [fatal 2022/07/24 15:22:55.021 UTC
│ └── Membership service failure: Failed to acknowledge a new membership view and then failed tcp/ip connection attempt
├── 2022/07/24 15:27:18.511 UTC
│ └── system restarted
Even though the system has restarted, trying to login to Pulse using the following node-07 URL fails.
http://node-07.local:7070/pulse
Looking at the node-07 log file, we see that the locator recorded that "Region /_ConfigurationRegion has potentially stale data" and "it is waiting for another member to recover the latest data." This is a private metadata region maintained by Geode/GemFire internals.
gfsh
shows there are only three (3) members in the cluster as expected.
gfsh
gfsh>connect --locator=node-07.local[10334]
Connecting to Locator at [host=node-07.local, port=10334] ..
Connecting to Manager at [host=192.168.56.17, port=9051] ..
Successfully connected to: [host=192.168.56.17, port=9051]
You are connected to a cluster of version: 9.10.13
gfsh>list members
Member Count : 4
Name | Id
------------------ | ---------------------------------------------------------------------------
sb-node-03 | 192.168.56.13(sb-node-03:13785)<v1>:41000
sb-node-04 | 192.168.56.14(sb-node-04:13767)<v1>:41000
sb-node-05 | 192.168.56.15(sb-node-05:13780)<v1>:41000
sb-locator-node-07 | 192.168.56.17(sb-locator-node-07:15214:locator)<ec><v0>:41000 [Coordinator]
We also note that the log file is inundated with the last message shown in the output below.
# View the second locator, node-07, log file
show_log -log locator -num 2 -full
Output:
...
[warn 2022/07/24 15:27:13.558 UTC sb-locator-node-07-01 <Function Execution Processor3> tid=0xbc] Unable to retrieve the cluster configuration
org.apache.geode.distributed.LockServiceDestroyedException: <DLockService@260ab948 named __CLUSTER_CONFIG_LS destroyed=true grantorId=[LockGrantorId: lockGrantorMember=192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v0>:41000, lockGrantorVersion=1, lockGrantorSerialNumber=4] grantor=null> has been destroyed
...
[info 2022/07/24 15:27:18.511 UTC sb-locator-node-07-01 <main> tid=0x1] system restarted
...
[info 2022/07/24 15:27:28.785 UTC sb-locator-node-07-01 <Location services restart thread> tid=0x84] Region /_ConfigurationRegion has potentially stale data. It is waiting for another member to recover the latest data.
My persistent id:
DiskStore ID: b52ca472-5ccd-46cb-a54a-4cd6c90b7591
Name: sb-locator-node-07-01
Location: /192.168.56.17:/home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-locator-node-07-01/ConfigDiskDir_sb-locator-node-07-01
Members with potentially new data:
[
DiskStore ID: c0f764b0-552a-4181-b1c9-411bbe504255
Name: sb-locator-node-06-01
Location: /192.168.56.16:/home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/run/sb-locator-node-06-01/ConfigDiskDir_sb-locator-node-06-01
]
Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
...
[info 2022/07/24 15:27:43.568 UTC sb-locator-node-07-01 <Function Execution Processor2> tid=0xbb] Received request for configuration: []
...
We now attempt to merge the splits.
cd_cluster sb/bin_sh
./merge_cluster
Running t_show_member_join_requests
shows that the node-07 locator received join requests from all the nodes from Split A.
Member join requests:
---------------------
sb-locator-node-06-01.log
├── 2022/07/24 15:19:52.122 UTC
│ └── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec>:41000
├── 2022/07/24 15:21:02.929 UTC
│ └── 192.168.56.11(sb-node-01:21418):41000
├── 2022/07/24 15:21:04.419 UTC
│ └── 192.168.56.12(sb-node-02:8275):41000
├── 2022/07/24 15:21:05.637 UTC
│ └── 192.168.56.13(sb-node-03:1067):41000
├── 2022/07/24 15:21:07.105 UTC
│ └── 192.168.56.14(sb-node-04:1152):41000
├── 2022/07/24 15:21:08.627 UTC
│ └── 192.168.56.15(sb-node-05:1715):41000
sb-locator-node-07-01.log
├── 2022/07/24 15:26:17.424 UTC
│ └── 192.168.56.13(sb-node-03:1067):41000
├── 2022/07/24 15:26:20.874 UTC
│ └── 192.168.56.15(sb-node-05:1715):41000
├── 2022/07/24 15:26:21.495 UTC
│ └── 192.168.56.14(sb-node-04:1152):41000
├── 2022/07/24 15:34:50.708 UTC
│ └── 192.168.56.11(sb-node-01:21418):41000
├── 2022/07/24 15:34:51.086 UTC
│ └── 192.168.56.12(sb-node-02:8275):41000
Running t_show_cluster_views
shows that the node-07 locator published a new cluster view with all the nodes intact.
t_show_cluster_views -long
Output:
...
sb-locator-node-07-01.log:[info 2022/07/24 15:34:56.847 UTC sb-locator-node-07-01 <Geode Membership View Creator> tid=0xa4]
└── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v0>:41000|4
├── 192.168.56.17(sb-locator-node-07-01:6963:locator)<ec><v0>:41000,
├── 192.168.56.13(sb-node-03:1067)<v1>:41000{lead},
├── 192.168.56.15(sb-node-05:1715)<v1>:41000,
├── 192.168.56.14(sb-node-04:1152)<v1>:41000,
├── 192.168.56.11(sb-node-01:21418)<v2>:41000,
├── 192.168.56.12(sb-node-02:8275)<v3>:41000,
└── 192.168.56.16(sb-locator-node-06-01:3075:locator)<ec><v4>:41000
Pulse also shows that all the members are back in the cluster but with only persistent data. Non-persistent data is lost.
Running test_group
shows the persistent regions with data and the non-persistent regions without data.
./test_group -list
Output:
...
group: g1
- name: replicate_persistent_overflow
data: Region
size: 10000
cleared: false
- name: partition_persistent
data: Region
size: 10000
cleared: false
- name: partition
data: Region
size: 0
cleared: false
- name: replicate
data: Region
size: 0
cleared: false
- name: replicate_persistent
data: Region
size: 10000
cleared: false
- name: partition_overflow
data: Region
size: 0
cleared: false
- name: partition_persistent_overflow
data: Region
size: 10000
cleared: false
- name: replicate_overflow
data: Region
size: 0
cleared: false
...
- This test case split the network to isolate two (2) members and a locator from the cluster.
- We saw Pulse immediately goes out of commission.
- We saw the cluster with more weight (Split B) wins the quorum and the second locator in Split B becomes the new coordinator.
- The new cluster restarts by itself but it gets stuck with the "stale data" error. We were unable to connect to the new cluster.
- We then merged the splits and saw the Split A members immediately join the new cluster coordinated by the second locator.
- We were also able to connect to Pulse which showed all the members intact.
- However, we lost non-persistent data. The auto-restarted cluster has only persistent data.
Quorum without locators. In this test case, we split the network with a quorm that does not include locators, i.e., Split A with locators and Split B with a quorum but without locators, as shown below.
Split B quorum without locators.
Split | Weight | VM Hosts |
---|---|---|
A | 26 (43%) | node-01, node-02, node-06, node-07 (locators) |
B | 35 (57%) | node-03 (lead), node-04, node-05 |
Restart the cluster and ingest data.
❗ Make sure to run start_cluster_lead_3
, not start_cluster_lead_1
for this test case. In this test case, node-03 is the lead.
# Kill the cluster and clean all log and persistent files
kill_cluster -all
clean_cluster -all
# To make sure node-01 gets selected as the lead member, run 'start_cluster_lead_1' in 'bin_sh/'
cd_cluster sb/bin_sh
./start_cluster_lead_3
# Once the cluster is up, ingest data.
cd_app perf_test_sb/bin_sh
./test_group -run
# Check the entry count in each region to which we just ingested data
./test_group -list
From pnode.local
, run split_cluster
as follows:
switch_cluster sb/bin_sh
# Replace 'enp0s8' with the correct interface. Run './split_cluster -?' to see the usage.
./split_cluster -type 3 enp0s8
Pulse immediately reports suspects and freezes. It is no longer usable.
Running t_show_cluster_views
shows that node-07 is the coordinator.
t_show_cluster_views
Output:
...
sb-locator-node-06.log: [warn 2022/08/10 17:56:30.307 UTC
└── these members failed to respond to the view change
├── 192.168.56.13(sb-node-03:18315)<v2>:41000
├── 192.168.56.15(sb-node-05:18318)<v6>:41000
sb-locator-node-07.log: 2022/08/10 17:53:31.185 UTC
└── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:21027:locator)<ec><v1>:41000
sb-locator-node-07.log: 2022/08/10 17:54:10.712 UTC
└── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:21027:locator)<ec><v1>:41000
├── 192.168.56.13(sb-node-03:18315)<v2>:41000{lead}
sb-locator-node-07.log: 2022/08/10 17:54:11.877 UTC
└── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000|3
├── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:21027:locator)<ec><v1>:41000
├── 192.168.56.13(sb-node-03:18315)<v2>:41000{lead}
├── 192.168.56.11(sb-node-01:19826)<v3>:41000
sb-locator-node-07.log: 2022/08/10 17:54:13.304 UTC
└── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000|4
├── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:21027:locator)<ec><v1>:41000
├── 192.168.56.13(sb-node-03:18315)<v2>:41000{lead}
├── 192.168.56.11(sb-node-01:19826)<v3>:41000
├── 192.168.56.12(sb-node-02:18819)<v4>:41000
sb-locator-node-07.log: 2022/08/10 17:54:14.711 UTC
└── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000|5
├── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:21027:locator)<ec><v1>:41000
├── 192.168.56.13(sb-node-03:18315)<v2>:41000{lead}
├── 192.168.56.11(sb-node-01:19826)<v3>:41000
├── 192.168.56.12(sb-node-02:18819)<v4>:41000
├── 192.168.56.14(sb-node-04:18309)<v5>:41000
sb-locator-node-07.log: 2022/08/10 17:54:16.111 UTC
└── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000|6
├── 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:21027:locator)<ec><v1>:41000
├── 192.168.56.13(sb-node-03:18315)<v2>:41000{lead}
├── 192.168.56.11(sb-node-01:19826)<v3>:41000
├── 192.168.56.12(sb-node-02:18819)<v4>:41000
├── 192.168.56.14(sb-node-04:18309)<v5>:41000
├── 192.168.56.15(sb-node-05:18318)<v6>:41000
Now: 2022/08/10 18:01:30 UTC
Running t_show_all_suspect_node_pairs
shows that node-02, node-03, and node-07 are suspects. These members do not represent neither Split A nor Split B. We cannot rely on this information to determine the network partition state.
t_show_all_suspect_node_pairs
Output:
Suspect: node-01
---------------------------------------
Not found.
Suspect: node-02
---------------------------------------
For:
├── 192.168.56.14(sb-node-04:18309)
└── 192.168.56.15(sb-node-05:18318)
First logged:
Last logged:
Suspect: node-03
---------------------------------------
For:
└── 192.168.56.14(sb-node-04:18309)
First logged:
Last logged:
2022-08-10 17:56:28
├── from: 192.168.56.13(sb-node-03:18315)<v2>:41000
├── for: 192.168.56.14(sb-node-04:18309)<v5>:41000
└── member unexpectedly shut down shared, unordered connection
Suspect: node-04
---------------------------------------
Not found.
Suspect: node-05
---------------------------------------
Not found.
Suspect: node-06
---------------------------------------
Not found.
Suspect: node-07
---------------------------------------
For:
└── 192.168.56.13(sb-node-03:18315)
First logged:
Last logged:
Now: 2022/08/10 18:04:53 UTC
Running t_show_membership_service_failure
clearly shows the state of the cluster. The node-06 locator coordinator has declared a network partition with the Split B members listed. Unlike t_show_all_suspect_node_pairs
, it accurately determined all the unreachable members. The t_show_all_suspect_node_pairs
discrepency is due to the incomplete log messages being recorded in the log files while the cluster is experiencing the network failure. It is important to note that we must use all the available diagnostic tools before arriving to a conclusion.
t_show_membership_service_failure
Output:
Member service failure
----------------------
sb-locator-node-06.log
├── [fatal 2022/08/10 17:56:31.315 UTC
│ └── Membership service failure: Exiting due to possible network partition event due to loss of 3 cache processes:
│ ├── 192.168.56.13(sb-node-03:18315)<v2>:41000
│ ├── 192.168.56.14(sb-node-04:18309)<v5>:41000
│ └── 192.168.56.15(sb-node-05:18318)<v6>:41000
sb-locator-node-07.log
├── [fatal 2022/08/10 17:56:30.363 UTC
│ └── Membership service failure: Membership coordinator 192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000 has declared that a network partition has occurred
Now: 2022/08/10 18:07:10 UTC
gfsh
is unable to connect to any of the locators.
gfsh
gfsh>connect --url=http://node-06.local:7070/geode-mgmt/v1
I/O error on GET request for "http://node-07.local:7070/geode-mgmt/v1/ping": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused)
test_group -list
also unable to connect to the cluster.
cd_app perf_test_sb/bin_sh
./test_group -list
Output:
...
Exception in thread "main" org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [LocatorAddress [socketInetAddress=node-06.local/192.168.56.16:10334, hostname=node-06.local, isIpString=false], LocatorAddress [socketInetAddress=node-07.local/192.168.56.17:10334, hostname=node-07.local, isIpString=false]]
...
Running show_cluster
shows all the members are still running.
show_cluster
Output:
CLUSTER: sb
CLUSTER_DIR: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb
PRODUCT: geode
CLUSTER_TYPE: gemfire
Run Type: default
Deployment: VM
Locators Running: 2/2
Members Running: 5/5
Version: 1.14.4
Switch Cluster: switch_rwe rwe-bundles bundle-geode-1-app-perf_test_sb-cluster-sb; switch_cluster sb
We now attempt to merge the splits.
cd_cluster sb/bin_sh
./merge_cluster
After a few minutes, running t_show_membership_service_failure
shows the locators are not able to recover the cluster.
Looking at the node-07 locator log, it is trying to establish a quorum but fails because the sum of the four (4) responding members is only 26 which is less than the quorum threadhold of 31. Unfortunately, the log does not reveal the reponding member names. We can guest they are 2 members and 2 locators (2*10+2*3=26). They should be the members in Split A which includes node-01 as the lead having the weight of 15. However, the t_show_cluster_views
ouput above shows that the lead has been changed to node-03. This seems to explain why the quorum is less than 26.
# View node-07 locator log
show_log -log locator -num 2
Output:
...
[info 2022/08/10 18:27:32.866 UTC sb-locator-node-07 <Location services restart thread> tid=0x5d] quorum check: timeout waiting for responses. 4 responses received
[info 2022/08/10 18:27:32.867 UTC sb-locator-node-07 <Location services restart thread> tid=0x5d] quorum check: contacted 4 processes with 26 member weight units. Threshold for a quorum is 31
[info 2022/08/10 18:28:02.893 UTC sb-locator-node-07 <Location services restart thread> tid=0x5d] beginning quorum check with GMSQuorumChecker on view View[192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000|-100] members: [192.168.56.16(sb-locator-node-06:21653:locator)<ec><v0>:41000, 192.168.56.17(sb-locator-node-07:21027:locator)<ec><v1>:41000, 192.168.56.13(sb-node-03:18315)<v2>:41000{lead}, 192.168.56.11(sb-node-01:19826)<v3>:41000, 192.168.56.12(sb-node-02:18819)<v4>:41000, 192.168.56.14(sb-node-04:18309)<v5>:41000, 192.168.56.15(sb-node-05:18318)<v6>:41000]
...
We can try to add a new node to the cluster to increase the total weight, but the locators will refuse connections.
We have no way of restarting the cluster. It is stuck in a quorem check loop indefinitely.
- This test case split the network to isolate three (3) members from the cluster.
- We saw Pulse immediately goes out of commission.
- We saw the second locator become the coordinator and continuously failing to restart the cluster due to a insufficient quorum.
- We saw the lead member changed from Split A to Split B, taking away the weight needed to satisfy a quorum.
- Because of the insufficient quorum, the coordinator is unable to restart the cluster and stuck in a retry loop indefinitely.
- The cluster must be manually restarted.
Locators isolated, data nodes unreachable. In this test case, we isolate locators, i.e., Split A with locators but without data nodes, as shown below.
Split | Weight | VM Hosts |
---|---|---|
A | 6 (10%) | node-06, node-07 (locators) |
B | 55 (90%) | node-01 (lead), node-02, node-03, node-04, node-05 |
Restart the cluster and ingest data.
# Kill the cluster and clean all log and persistent files
kill_cluster -all
clean_cluster -all
# To make sure node-01 gets selected as the lead member, run 'start_cluster_lead_1' in 'bin_sh/'
cd_cluster sb/bin_sh
./start_cluster_lead_1
# Once the cluster is up, ingest data.
cd_app perf_test_sb/bin_sh
./test_group -run
# Check the entry count in each region to which we just ingested data
./test_group -list
From pnode.local
, run split_cluster
as follows:
switch_cluster sb/bin_sh
# Replace 'enp0s8' with the correct interface. Run './split_cluster -?' to see the usage.
./split_cluster -type 4 enp0s8
Pulse immediately reports suspects and freezes. It is no longer usable.
Running t_show_all_suspect_node_pairs
shows that node-06 and node-07 are suspects for all the data nodes. This is expected since both locators are not reachable by the data nodes.
t_show_all_suspect_node_pairs
Output:
Suspect: node-01
---------------------------------------
For:
└── 192.168.56.12(sb-node-02:20666)
First logged:
Last logged:
Suspect: node-02
---------------------------------------
Not found.
Suspect: node-03
---------------------------------------
Not found.
Suspect: node-04
---------------------------------------
For:
└── 192.168.56.12(sb-node-02:20666)
First logged:
Last logged:
2022-08-10 18:33:32
├── from: 192.168.56.14(sb-node-04:19970)<v2>:41000
├── for: 192.168.56.12(sb-node-02:20666)<v2>:41000
└── member unexpectedly shut down shared, unordered connection
Suspect: node-05
---------------------------------------
For:
└── 192.168.56.12(sb-node-02:20666)
First logged:
2022-08-10 18:33:32
├── from: 192.168.56.15(sb-node-05:19975)<v2>:41000
├── for: 192.168.56.12(sb-node-02:20666)<v2>:41000
└── member unexpectedly shut down shared, unordered connection
Last logged:
Suspect: node-06
---------------------------------------
For:
├── 192.168.56.11(sb-node-01:21684)
├── 192.168.56.13(sb-node-03:20025)
├── 192.168.56.14(sb-node-04:19970)
├── 192.168.56.14(sb-node-04:19970)
└── 192.168.56.15(sb-node-05:19975)
First logged:
Last logged:
Suspect: node-07
---------------------------------------
For:
├── 192.168.56.11(sb-node-01:21684)
├── 192.168.56.12(sb-node-02:20666)
├── 192.168.56.13(sb-node-03:20025)
├── 192.168.56.14(sb-node-04:19970)
└── 192.168.56.15(sb-node-05:19975)
First logged:
Last logged:
Now: 2022/08/10 18:35:10 UTC
Running t_show_cluster_views
shows that initially, both locators (node-06 and node-07) can only see themselves but at 18:31:25
, both recorded a complete view of all the data nodes. We also see after 5:31 minutes, at 18:33:45.230
, there are four (4) data nodes failed to respond to the view change. We are expecting all five (5) data nodes to fail to repond, however. The cluster views do not represent the current state of the cluster. We need to investigate the log files further.
t_show_cluster_views -long
Output:
Received Views
--------------
sb-locator-node-06.log:[info 2022/08/10 18:30:14.797 UTC sb-locator-node-06 <main> tid=0x1]
└── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000|0
├── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000
sb-locator-node-06.log:[info 2022/08/10 18:30:15.129 UTC sb-locator-node-06 <Geode Membership View Creator> tid=0x25]
└── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:23497:locator)<ec><v1>:41000
sb-locator-node-06.log:[info 2022/08/10 18:31:25.488 UTC sb-locator-node-06 <Geode Membership View Creator> tid=0x25]
└── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:23497:locator)<ec><v1>:41000
├── 192.168.56.12(sb-node-02:20666)<v2>:41000{lead}
├── 192.168.56.11(sb-node-01:21684)<v2>:41000
├── 192.168.56.15(sb-node-05:19975)<v2>:41000
├── 192.168.56.13(sb-node-03:20025)<v2>:41000
├── 192.168.56.14(sb-node-04:19970)<v2>:41000
sb-locator-node-06.log:[warn 2022/08/10 18:33:45.230 UTC sb-locator-node-06 <Geode Membership View Creator> tid=0x25]
└── these members failed to respond to the view change
├── 192.168.56.11(sb-node-01:21684)<v2>:41000
├── 192.168.56.13(sb-node-03:20025)<v2>:41000
├── 192.168.56.14(sb-node-04:19970)<v2>:41000
├── 192.168.56.15(sb-node-05:19975)<v2>:41000
sb-locator-node-07.log:[info 2022/08/10 18:30:15.087 UTC sb-locator-node-07 <unicast receiver,node-07-39292> tid=0x1e]
└── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:23497:locator)<ec><v1>:41000
sb-locator-node-07.log:[info 2022/08/10 18:31:25.444 UTC sb-locator-node-07 <unicast receiver,node-07-39292> tid=0x1e]
└── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000|2
├── 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:23497:locator)<ec><v1>:41000
├── 192.168.56.12(sb-node-02:20666)<v2>:41000{lead}
├── 192.168.56.11(sb-node-01:21684)<v2>:41000
├── 192.168.56.15(sb-node-05:19975)<v2>:41000
├── 192.168.56.13(sb-node-03:20025)<v2>:41000
├── 192.168.56.14(sb-node-04:19970)<v2>:41000
Now: 2022/08/10 18:35:38 UTC
Since the cluster views do not reflect the current state of the cluster, we turn to t_show_membership_service_failure
, which shows what we are expecting. We can see from the output below that a network partition has occurred and all five (5) data nodes are unreachable by the node-06 locator.
t_show_membership_service_failure
Output:
Member service failure
----------------------
sb-locator-node-06.log
├── [fatal 2022/08/10 18:33:56.253 UTC
│ └── Membership service failure: Exiting due to possible network partition event due to loss of 5 cache processes:
│ ├── 192.168.56.11(sb-node-01:21684)<v2>:41000
│ ├── 192.168.56.12(sb-node-02:20666)<v2>:41000
│ ├── 192.168.56.13(sb-node-03:20025)<v2>:41000
│ ├── 192.168.56.14(sb-node-04:19970)<v2>:41000
│ └── 192.168.56.15(sb-node-05:19975)<v2>:41000
sb-locator-node-07.log
├── [fatal 2022/08/10 18:33:55.202 UTC
│ └── Membership service failure: Membership coordinator 192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000 has declared that a network partition has occurred
Now: 2022/08/10 18:38:34 UTC
Attempting to connect to the cluster using gfsh
fails for both locators as shown below.
gfsh>connect --url=http://node-06.local:7070/geode-mgmt/v1
I/O error on GET request for "http://node-06.local:7070/geode-mgmt/v1/ping": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused)
We now attempt to merge the splits.
cd_cluster sb/bin_sh
./merge_cluster
The node-06 locator log files show that both locators are sending "quorum check" requests to all the members, but none of them are responding as shown below. It is able to contact two (2) members at 18:42:12.286
with a total weight of 6 indicating that they are locators. Each locator has the weight of 3. The node-06 locator continously try to contact the data nodes but fail to get responses and therefore the cluster cannot be reinstated.
# View node-06 locator log
show_log -log locator
Output:
...
[info 2022/08/10 18:42:11.785 UTC sb-locator-node-06 <Location services restart thread> tid=0x87] quorum check: waiting up to 456ms to receive a quorum of responses
[info 2022/08/10 18:42:12.286 UTC sb-locator-node-06 <Location services restart thread> tid=0x87] quorum check: timeout waiting for responses. 2 responses received
[info 2022/08/10 18:42:12.287 UTC sb-locator-node-06 <Location services restart thread> tid=0x87] quorum check: contacted 2 processes with 6 member weight units. Threshold for a quorum is 31
[info 2022/08/10 18:42:42.324 UTC sb-locator-node-06 <Location services restart thread> tid=0x87] beginning quorum check with GMSQuorumChecker on view View[192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000|-
100] members: [192.168.56.16(sb-locator-node-06:23888:locator)<ec><v0>:41000, 192.168.56.17(sb-locator-node-07:23497:locator)<ec><v1>:41000, 192.168.56.12(sb-node-02:20666)<v2>:41000{lead}, 192.168.56.11(sb-node-01:2168
4)<v2>:41000, 192.168.56.15(sb-node-05:19975)<v2>:41000, 192.168.56.13(sb-node-03:20025)<v2>:41000, 192.168.56.14(sb-node-04:19970)<v2>:41000]
[info 2022/08/10 18:42:42.325 UTC sb-locator-node-06 <Location services restart thread> tid=0x87] quorum check: sending request to 192.168.56.12(sb-node-02:20666)<v2>:41000
...
[info 2022/08/10 18:43:02.877 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Serial Queue info : THROTTLE_PERCENT: 0.75 SERIAL_QUEUE_BYTE_LIMIT :41943040 SERIAL_QUEUE_THROTTLE :31457280 TOTAL_SERIAL_QUEUE_BYTE_LIMIT
:83886080 TOTAL_SERIAL_QUEUE_THROTTLE :31457280 SERIAL_QUEUE_SIZE_LIMIT :20000 SERIAL_QUEUE_SIZE_THROTTLE :15000
[info 2022/08/10 18:43:02.878 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Starting membership services
[info 2022/08/10 18:43:02.879 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Established local address 192.168.56.16(sb-locator-node-06:23888:locator)<ec>:41000
[info 2022/08/10 18:43:02.879 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] This member is hosting a locator will be preferred as a membership coordinator
[info 2022/08/10 18:43:02.879 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] JGroups channel reinitialized (took 0ms)
[info 2022/08/10 18:43:02.880 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] GemFire P2P Listener started on /192.168.56.16:57185
[info 2022/08/10 18:43:02.880 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Delivering 0 messages queued by quorum checker
[info 2022/08/10 18:43:02.880 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Peer locator is connecting to local membership services with ID 192.168.56.16(sb-locator-node-06:23888:locator)<ec>:41000
[info 2022/08/10 18:43:02.880 UTC sb-locator-node-06 <Geode Failure Detection Server thread 1> tid=0x11a] Started failure detection server thread on /192.168.56.16:41982.
[info 2022/08/10 18:43:02.883 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Discovery state after looking for membership coordinator is locatorsContacted=0; findInViewResponses=0; alreadyTried=[]; registrants=[]; p
ossibleCoordinator=null; viewId=-100; hasContactedAJoinedLocator=false; view=null; responses=[]
[info 2022/08/10 18:43:02.883 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Stopping membership services
[info 2022/08/10 18:43:02.883 UTC sb-locator-node-06 <Geode Failure Detection Server thread 1> tid=0x11a] GMSHealthMonitor server thread exiting
[warn 2022/08/10 18:43:02.884 UTC sb-locator-node-06 <ReconnectThread> tid=0x86] Caught GemFireConfigException in reconnect
org.apache.geode.GemFireConfigException: Unable to join the distributed system. Operation either timed out, was stopped or Locator does not exist.
at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:182)
at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:220)
...
Running show_cluster
shows all the members are still running.
show_cluster
Output:
CLUSTER: sb
CLUSTER_DIR: /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb
PRODUCT: geode
CLUSTER_TYPE: gemfire
Run Type: default
Deployment: VM
Locators Running: 2/2
Members Running: 5/5
Version: 9.10.13
Switch Cluster: switch_rwe rwe-bundles bundle-geode-1-app-perf_test_sb-cluster-sb; switch_cluster sb
Checking member log files, we see that their repeated attempts to connect to the node-06 locator are failing. The log content below shows that the membership services are failing. It logs the message, "Delivering 0 messages queued by quorum checker", which indicates that it is unable to deliver any qurom information. In the meantime, in the previous step, we saw the node-06 locator (coordinator) has the weight sum of only 6 and indefinitely wating for member responses to establish a quorum of threshold 31.
# Show member #1 log
show_log -num 1
Output:
...
2022-08-10 18:46:17 INFO InternalDistributedSystem:2537 - Disconnecting old DistributedSystem to prepare for a reconnect attempt
2022-08-10 18:47:17 INFO InternalDistributedSystem:2559 - Attempting to reconnect to the distributed system. This is attempt #12.
2022-08-10 18:47:18 INFO LoggingProviderLoader:69 - Using org.apache.geode.logging.log4j.internal.impl.Log4jLoggingProvider from ServiceLoader for service org.apache.geode.logging.internal.spi.LoggingProvider
2022-08-10 18:47:18 INFO InternalDataSerializer:416 - initializing InternalDataSerializer with 11 services
2022-08-10 18:47:18 INFO ClusterOperationExecutors:187 - Serial Queue info : THROTTLE_PERCENT: 0.75 SERIAL_QUEUE_BYTE_LIMIT :41943040 SERIAL_QUEUE_THROTTLE :31457280 TOTAL_SERIAL_QUEUE_BYTE_LIMIT :83886080 TOTAL_SERIAL_QUEUE_THROTTLE :31457280 SERIAL_QUEUE_SIZE_LIMIT :20000 SERIAL_QUEUE_SIZE_THROTTLE :15000
2022-08-10 18:47:18 INFO Services:201 - Starting membership services
2022-08-10 18:47:18 INFO Services:562 - Established local address 192.168.56.11(sb-node-01:21684):41000
2022-08-10 18:47:18 INFO Services:398 - JGroups channel reinitialized (took 0ms)
2022-08-10 18:47:18 INFO DirectChannel:143 - GemFire P2P Listener started on /192.168.56.11:60172
2022-08-10 18:47:18 INFO Services:421 - Delivering 0 messages queued by quorum checker
2022-08-10 18:47:18 INFO Services:705 - Started failure detection server thread on /192.168.56.11:60633.
2022-08-10 18:47:18 INFO Services:330 - Discovery state after looking for membership coordinator is locatorsContacted=0; findInViewResponses=0; alreadyTried=[]; registrants=[]; possibleCoordinator=null; viewId=-100; hasContactedAJoinedLocator=false; view=null; responses=[]
2022-08-10 18:47:18 INFO Services:287 - Stopping membership services
2022-08-10 18:47:18 INFO Services:733 - GMSHealthMonitor server thread exiting
2022-08-10 18:47:18 WARN InternalDistributedSystem:2619 - Caught GemFireConfigException in reconnect
org.apache.geode.GemFireConfigException: Unable to join the distributed system. Operation either timed out, was stopped or Locator does not exist.
at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:182) ~[geode-core-9.10.13.jar:?]
at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:220) ~[geode-core-9.10.13.jar:?]
...
We can conclude from the log messages that the cluster will not be reinstated. We now manually restart one of the members to see if it is able to connect to the locators.
# Stop the first member
stop_member
# Check if it is down
show_cluster
# Restart the first member
start_member
# View the first member's log
show_log
Output:
...
2022-08-10 18:51:55 INFO InternalDataSerializer:416 - initializing InternalDataSerializer with 11 services
2022-08-10 18:51:55 INFO ClusterOperationExecutors:187 - Serial Queue info : THROTTLE_PERCENT: 0.75 SERIAL_QUEUE_BYTE_LIMIT :41943040 SERIAL_QUEUE_THROTTLE :31457280 TOTAL_SERIAL_QUEUE_BYTE_LIMIT :83886080 TOTAL_SERIAL_QUEUE_THROTTLE :31457280 SERIAL_QUEUE_SIZE_LIMIT :20000 SERIAL_QUEUE_SIZE_THROTTLE :15000
2022-08-10 18:51:55 INFO Services:201 - Starting membership services
2022-08-10 18:51:55 INFO UNICAST3:71 - both the regular and OOB thread pools are disabled; UNICAST3 could be removed (JGRP-2069)
2022-08-10 18:51:55 INFO Services:562 - Established local address 192.168.56.11(sb-node-01:23571):41000
2022-08-10 18:51:55 INFO Services:398 - JGroups channel created (took 236ms)
2022-08-10 18:51:55 INFO DirectChannel:143 - GemFire P2P Listener started on /192.168.56.11:46604
2022-08-10 18:51:55 INFO Services:705 - Started failure detection server thread on /192.168.56.11:59374.
2022-08-10 18:51:55 INFO Services:330 - Discovery state after looking for membership coordinator is locatorsContacted=0; findInViewResponses=0; alreadyTried=[]; registrants=[]; possibleCoordinator=null; viewId=-100; hasContactedAJoinedLocator=false; view=null; responses=[]
2022-08-10 18:51:55 INFO Services:287 - Stopping membership services
2022-08-10 18:51:55 INFO Services:733 - GMSHealthMonitor server thread exiting
As expected, the output above shows that we are not able to restart any of the members. They fail to connect to the node-06 locator. The only choice we have is to restart the entire cluster including the locators.
- This test case split the network to isolate both locators from the cluster.
- We saw Pulse immediately goes out of commission.
- We saw both locators as suspects for all members.
- We saw the cluster views recorded by the locators do not represent the current state of the cluster.
- But the locators clearly declared a network partition by recording that all members are not reachable.
- Merging partitioned clusters did not reinstate the cluster. Both locators and members are unable to connect to each other.
- The cluster must be manually restarted.
Data nodes isolated, locators reachable. In this test case, we split the network to isolate data nodes but the locators are still reachable by all data nodes, i.e., both Split A and Split B with locators, as shown below.
Split | Weight | VM Hosts |
---|---|---|
A | 31 (51%) | node-01 (lead), node-02, node-06, node-07 (locators) |
B | 36 (59%) | node-03, node-04, node-05, node-06, node-07 (locators) |
Restart the cluster and ingest data.
# Kill the cluster and clean all log and persistent files
kill_cluster -all
clean_cluster -all
# To make sure node-01 gets selected as the lead member, run 'start_cluster_lead_1' in 'bin_sh/'
cd_cluster sb/bin_sh
./start_cluster_lead_1
# Once the cluster is up, ingest data.
cd_app perf_test_sb/bin_sh
./test_group -run
# Check the entry count in each region to which we just ingested data
./test_group -list
From pnode.local
, run split_cluster
as follows:
switch_cluster sb/bin_sh
# Replace 'enp0s8' with the correct interface. Run './split_cluster -?' to see the usage.
./split_cluster -type 5 enp0s8
Pulse continues to run showing all members inact.
Running t_show_cluster_views
shows that all members are running.
t_show_cluster_views
Output:
Received Views
--------------
sb-locator-node-06.log:[info 2022/08/10 22:19:27.729 UTC sb-locator-node-06 <main> tid=0x1]
└── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000|0
├── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000
...
sb-locator-node-06.log: 2022/08/10 22:21:14.006 UTC
└── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000|6
├── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:28906:locator)<ec><v1>:41000
├── 192.168.56.11(sb-node-01:27941)<v2>:41000{lead}
├── 192.168.56.12(sb-node-02:26124)<v3>:41000
├── 192.168.56.13(sb-node-03:25356)<v4>:41000
├── 192.168.56.14(sb-node-04:25190)<v5>:41000
├── 192.168.56.15(sb-node-05:25149)<v6>:41000
sb-locator-node-07.log: 2022/08/10 22:20:27.570 UTC
└── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000|1
├── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:28906:locator)<ec><v1>:41000
...
sb-locator-node-07.log: 2022/08/10 22:21:14.008 UTC
└── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000|6
├── 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000
├── 192.168.56.17(sb-locator-node-07:28906:locator)<ec><v1>:41000
├── 192.168.56.11(sb-node-01:27941)<v2>:41000{lead}
├── 192.168.56.12(sb-node-02:26124)<v3>:41000
├── 192.168.56.13(sb-node-03:25356)<v4>:41000
├── 192.168.56.14(sb-node-04:25190)<v5>:41000
├── 192.168.56.15(sb-node-05:25149)<v6>:41000
Now: 2022/08/10 22:23:21 UTC
Running t_show_all_suspect_node_pairs
shows node-02 is a suspect for node-03, node-04, and node-05. The "First Logged" and "Last Logged" messages consistently show that node-02 is a suspect (from) for node-03, node-04, and node-05.
t_show_all_suspect_node_pairs
Output:
Suspect: node-01
---------------------------------------
Not found.
Suspect: node-02
---------------------------------------
For:
├── 192.168.56.13(sb-node-03:25356)
├── 192.168.56.14(sb-node-04:25190)
└── 192.168.56.15(sb-node-05:25149)
First logged:
2022/08/10 22:22:44.623 UTC
├── log: sb-locator-node-06.log
├── from: 192.168.56.12(sb-node-02:26124)<v3>:41000
├── for: 192.168.56.13(sb-node-03:25356)<v4>:41000
└── msg: Member isn't responding to heartbeat requests
2022/08/10 22:22:47.104 UTC
├── log: sb-locator-node-06.log
├── from: 192.168.56.12(sb-node-02:26124)<v3>:41000
├── for: 192.168.56.14(sb-node-04:25190)<v5>:41000
└── msg: Member isn't responding to heartbeat requests
2022/08/10 22:22:49.602 UTC
├── log: sb-locator-node-06.log
├── from: 192.168.56.12(sb-node-02:26124)<v3>:41000
├── for: 192.168.56.15(sb-node-05:25149)<v6>:41000
└── msg: Member isn't responding to heartbeat requests
Last logged:
2022-08-10 22:22:49
├── log: sb-node-05.log
├── from: 192.168.56.12(sb-node-02:26124)<v3>:41000
├── for: 192.168.56.15(sb-node-05:25149)<v6>:41000
└── msg: Member isn't responding to heartbeat requests
2022-08-10 23:00:34
├── log: sb-node-05.log
├── from: 192.168.56.12(sb-node-02:26124)<v3>:41000
├── for: 192.168.56.13(sb-node-03:25356)<v4>:41000
└── msg: Member isn't responding to heartbeat requests
2022-08-10 23:00:39
├── log: sb-node-05.log
├── from: 192.168.56.12(sb-node-02:26124)<v3>:41000
├── for: 192.168.56.14(sb-node-04:25190)<v5>:41000
└── msg: Member isn't responding to heartbeat requests
Suspect: node-03
---------------------------------------
Not found.
Suspect: node-04
---------------------------------------
Not found.
Suspect: node-05
---------------------------------------
Not found.
Suspect: node-06
---------------------------------------
Not found.
Suspect: node-07
---------------------------------------
Not found.
Suspect: myself
---------------------------------------
For:
├── 192.168.56.13(sb-node-03:25356)
├── 192.168.56.14(sb-node-04:25190)
└── 192.168.56.15(sb-node-05:25149)
First logged:
2022-08-10 22:22:44
├── log: sb-node-02.log
├── from: myself
├── for: 192.168.56.13(sb-node-03:25356)<v4>:41000
└── msg: Member isn't responding to heartbeat requests
2022-08-10 22:22:47
├── log: sb-node-02.log
├── from: myself
├── for: 192.168.56.14(sb-node-04:25190)<v5>:41000
└── msg: Member isn't responding to heartbeat requests
2022-08-10 22:22:49
├── log: sb-node-02.log
├── from: myself
├── for: 192.168.56.15(sb-node-05:25149)<v6>:41000
└── msg: Member isn't responding to heartbeat requests
Last logged:
2022-08-10 22:22:49
├── log: sb-node-02.log
├── from: myself
├── for: 192.168.56.15(sb-node-05:25149)<v6>:41000
└── msg: Member isn't responding to heartbeat requests
2022-08-10 23:01:02
├── log: sb-node-02.log
├── from: myself
├── for: 192.168.56.13(sb-node-03:25356)<v4>:41000
└── msg: Member isn't responding to heartbeat requests
2022-08-10 23:01:07
├── log: sb-node-02.log
├── from: myself
├── for: 192.168.56.14(sb-node-04:25190)<v5>:41000
└── msg: Member isn't responding to heartbeat requests
Now: 2022/08/10 23:01:09 UTC
Running show_log -all
shows that every member is complaining a suspect message.
show_log -all
...
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-01.log <==
2022-08-10 22:50:14 INFO Services:1187 - received suspect message from 192.168.56.12(sb-node-02:26124)<v3>:41000 for 192.168.56.13(sb-node-03:25356)<v4>:41000: Member isn't responding to heartbeat requests
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-02.log <==
2022-08-10 22:50:14 INFO Services:1187 - received suspect message from myself for 192.168.56.13(sb-node-03:25356)<v4>:41000: Member isn't responding to heartbeat requests
2022-08-10 22:50:14 INFO Services:974 - No longer suspecting 192.168.56.13(sb-node-03:25356)<v4>:41000
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-03.log <==
2022-08-10 22:50:14 INFO Services:1187 - received suspect message from 192.168.56.12(sb-node-02:26124)<v3>:41000 for 192.168.56.13(sb-node-03:25356)<v4>:41000: Member isn't responding to heartbeat requests
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-04.log <==
2022-08-10 22:50:14 INFO Services:1187 - received suspect message from 192.168.56.12(sb-node-02:26124)<v3>:41000 for 192.168.56.13(sb-node-03:25356)<v4>:41000: Member isn't responding to heartbeat requests
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-05.log <==
2022-08-10 22:50:14 INFO Services:1187 - received suspect message from 192.168.56.12(sb-node-02:26124)<v3>:41000 for 192.168.56.13(sb-node-03:25356)<v4>:41000: Member isn't responding to heartbeat requests
Running gfsh
shows all the members are still running in the cluster.
gfsh
gfsh>connect --url=http://node-06.local:7070/geode-mgmt/v1
Successfully connected to: GemFire Manager HTTP service @ http://node-06.local:7070/geode-mgmt/v1
You are connected to a cluster of version: 1.14.4
gfsh>list members
Member Count : 7
Name | Id
--------------------- | ------------------------------------------------------------------------------
sb-node-01 | 192.168.56.11(sb-node-01:20903)<v2>:41000
sb-node-02 | 192.168.56.12(sb-node-02:26025)<v3>:41000
sb-node-03 | 192.168.56.13(sb-node-03:24619)<v4>:41000
sb-node-04 | 192.168.56.14(sb-node-04:24124)<v5>:41000
sb-node-05 | 192.168.56.15(sb-node-05:23113)<v6>:41000
sb-locator-node-06-01 | 192.168.56.16(sb-locator-node-06-01:26294:locator)<ec><v0>:41000 [Coordinator]
sb-locator-node-07-01 | 192.168.56.17(sb-locator-node-07-01:24013:locator)<ec><v1>:41000
Running a region command from `gfsh` hangs indefinitely, however.
```bash
gfsh>describe region --name=/partition_persistent
Running t_show_stuck_threads
shows there are stuck threads.
.t_show_stuck_threads
Output:
Stuck Threads
-------------
Stuck thread(s)
├── first: sb-node-01.log:2022-08-10 23:15:07 WARN ThreadsMonitoringProcess:81 - There is 1 stuck thread in this node
├── last: sb-node-05.log:2022-08-10 23:16:12 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
└── reply:
Now: 2022/08/10 23:16:35 UTC
Looking at member logs shows there are stuck threads.
show_log -all
Output:
...
022-08-10 23:19:12 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-05.log <==
2022-08-10 23:19:12 WARN ThreadsMonitoringProcess:63 - Thread 90 (0x5a) is stuck
2022-08-10 23:19:12 WARN AbstractExecutor:51 - Thread <90> (0x5a) that was executed at <10 Aug 2022 23:13:53 UTC> has been stuck for <319.649 seconds> and number of thread monitor iteration <5>
Thread Name <Function Execution Processor2> state <TIMED_WAITING>
Waiting on <java.util.concurrent.CountDownLatch$Sync@32654086>
Executor Group <FunctionExecutionPooledExecutor>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
...
2022-08-10 23:19:12 WARN ThreadsMonitoringProcess:63 - Thread 124 (0x7c) is stuck
2022-08-10 23:19:12 WARN AbstractExecutor:51 - Thread <124> (0x7c) that was executed at <10 Aug 2022 23:15:10 UTC> has been stuck for <242.47 seconds> and number of thread monitor iteration <4>
Thread Name <Function Execution Processor3> state <TIMED_WAITING>
Waiting on <java.util.concurrent.CountDownLatch$Sync@389d2bf6>
Executor Group <FunctionExecutionPooledExecutor>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:733)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:804)
...
We now attempt to merge the splits.
cd_cluster sb/bin_sh
./merge_cluster
A few seconds after merging clusters, the members finally reply with the stuck thread in sight.
show_log -all
Output:
...
2022-08-10 23:23:12 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-01.log <==
2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-03.log <==
2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-05.log <==
2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-02.log <==
2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
==> /home/vagrant/rwe-bundles/bundle-geode-1-app-perf_test_sb-cluster-sb/clusters/sb/log/sb-node-04.log <==
2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
gfsh
also returns with an output.
gfsh>describe region --name=/partition_persistent
Name : partition_persistent
Data Policy : persistent partition
Hosting Members : sb-node-01
sb-node-02
sb-node-03
sb-node-04
sb-node-05
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
--------- | ---------------- | --------------------
Region | size | 10000
| data-policy | PERSISTENT_PARTITION
Partition | redundant-copies | 1
test_group -list
shows both persistent and non-persistent data are fully inact.
./test_group -list
Output:
...
group: g1
- name: partition_persistent
data: Region
size: 10000
cleared: false
- name: replicate
data: Region
size: 10000
cleared: false
- name: replicate_persistent_overflow
data: Region
size: 10000
cleared: false
- name: partition_overflow
data: Region
size: 10000
cleared: false
- name: partition
data: Region
size: 10000
cleared: false
- name: replicate_persistent
data: Region
size: 10000
cleared: false
- name: partition_persistent_overflow
data: Region
size: 10000
cleared: false
- name: replicate_overflow
data: Region
size: 10000
cleared: false
...
The node-06 locator log shows a warning message of "waiting for 5 replies" in FunctionStreamingResultCollector
. Immediately after the merger, FunctionStreamingResultCollector
logs "wait for replies completed."
show_log -log locator
Output:
...
[warn 2022/08/10 23:15:26.308 UTC sb-locator-node-06 <qtp853972908-57> tid=0x39] 15 seconds have elapsed while waiting for replies: <FunctionStreamingResultCollector 34643 waiting for 5 replies from [192.168.56.11(sb-node-01:27941)<v2>:41000, 192.168.56.12(sb-node-02:26124)<v3>:41000, 192.168.56.13(sb-node-03:25356)<v4>:41000, 192.168.56.14(sb-node-04:25190)<v5>:41000, 192.168.56.15(sb-node-05:25149)<v6>:41000]> on 192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000 whose current membership list is: [[192.168.56.16(sb-locator-node-06:29882:locator)<ec><v0>:41000, 192.168.56.17(sb-locator-node-07:28906:locator)<ec><v1>:41000, 192.168.56.11(sb-node-01:27941)<v2>:41000, 192.168.56.12(sb-node-02:26124)<v3>:41000, 192.168.56.13(sb-node-03:25356)<v4>:41000, 192.168.56.14(sb-node-04:25190)<v5>:41000, 192.168.56.15(sb-node-05:25149)<v6>:41000]]
...
[info 2022/08/10 23:23:27.169 UTC sb-locator-node-06 <qtp853972908-57> tid=0x39] FunctionStreamingResultCollector wait for replies completed
...
Running t_show_stuck_threads
shows that the stuck thread replies are completed.
t_show_stuck_threads
Output:
Stuck Threads
-------------
sb-locator-node-06.log
├── first:
├── last:
└── reply: sb-locator-node-06.log:[info 2022/08/10 23:23:27.169 UTC sb-locator-node-06 <qtp853972908-58> tid=0x3a] FunctionStreamingResultCollector wait for replies completed
sb-locator-node-06.log
├── first:
├── last:
└── reply: sb-locator-node-06.log:[info 2022/08/10 23:23:27.169 UTC sb-locator-node-06 <qtp853972908-57> tid=0x39] FunctionStreamingResultCollector wait for replies completed
sb-node-01.log
├── first: sb-node-01.log:2022-08-10 23:16:07 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
├── last: sb-node-01.log:2022-08-10 23:23:07 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
└── reply: sb-node-01.log:2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-01.log
├── first:
├── last:
└── reply: sb-node-01.log:2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-02.log
├── first: sb-node-02.log:2022-08-10 23:16:08 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
├── last: sb-node-02.log:2022-08-10 23:23:08 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
└── reply: sb-node-02.log:2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-02.log
├── first:
├── last:
└── reply: sb-node-02.log:2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-03.log
├── first: sb-node-03.log:2022-08-10 23:16:10 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
├── last: sb-node-03.log:2022-08-10 23:23:10 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
└── reply: sb-node-03.log:2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-03.log
├── first:
├── last:
└── reply: sb-node-03.log:2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-04.log
├── first: sb-node-04.log:2022-08-10 23:16:12 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
├── last: sb-node-04.log:2022-08-10 23:23:12 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
└── reply: sb-node-04.log:2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-04.log
├── first:
├── last:
└── reply: sb-node-04.log:2022-08-10 23:23:27 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-05.log
├── first: sb-node-05.log:2022-08-10 23:16:12 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
├── last: sb-node-05.log:2022-08-10 23:23:12 WARN ThreadsMonitoringProcess:79 - There are 2 stuck threads in this node
└── reply: sb-node-05.log:2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
sb-node-05.log
├── first:
├── last:
└── reply: sb-node-05.log:2022-08-10 23:23:23 INFO ReplyProcessor21:736 - SizeMessage$SizeResponse wait for replies completed
Now: 2022/08/10 23:36:40 UTC
In the meantime, Pulse continues to see all the members as if nothing happened.
- This test case split the network to isolate only members. The locators are reachable from all members.
- We saw Pulse behaves as if nothing happened. It continues to show that all the members are inact and normal.
- We saw nothing in the log files indicate there is a network partition.
- We then used
gfsh
to execute adescribe region
command, which blocked indefinitely. - All member log files show that one (1) thread is stuck. The locator also recorded that it is waiting for replies from all members.
- After waiting for over 11 minutes, we merged the splits.
- A few seconds later,
gfsh
finally returned with results. The member log files also recorded replies and the stuck thread is no longer in sight. The locator log file also showed the replies are complete. - The cluster continues to run without any issues. All data is recovered.
- In this test case, we found that Geode/GemFire solely relies on locators to detect network partitions. If members split but the locators are still reachable by all members then Geode/GemFire never declares a network partition. Only when a client performs Geode/GemFire operations, we are able to see warning messages.
- Troubleshooting Tip: Look for "stuck thread" warning messages and hanging clients.
Recovery | Type 0 | Type 1 | Type 2 | Type 3 | Type 4 | Type 5 |
---|---|---|---|---|---|---|
Persistent data recovered? | yes | yes | yes | no | no | yes |
Non-persistent data recovered? | yes | no | no | no | no | yes |
Cluster auto-restarted after merger? | no | yes | yes | no | no | yes |
Network partition declared? | yes | yes | yes | yes | yes | no |
Cluster quorum determined properly? | yes | yes | yes | no | no | no |
Clients able to connect during network partition? | yes | no | no | no | no | no |
gfsh able to connect during network partition? |
n/a | no | yes | no | no | yes |
gfsh able to connect after auto-restart? |
n/a | yes | yes | no | no | yes |
Clients able to connect any of the splits? | yes | no | no | no | no | yes (1) |
Clients able to connect after merger? | yes | yes | yes | no | no | yes |
Manual restart required? | no | no | no | yes | yes | no |
(1) Clients are able to connect but region operations get stuck indefinitely until the merger completes.
Based on our findings, we can put together a flow chart showing the steps involved in determining the network partition type.
We can use the Test Results to create a prognosis for each network partition type as follows.
Recovery | Type 0 | Type 1 | Type 2 | Type 3 | Type 4 | Type 5 |
---|---|---|---|---|---|---|
Persistent data recoverable? | yes | yes | yes | no | no | yes |
Non-persistent data recoverable? | yes | no | no | no | no | yes |
Cluster auto-restart upon resolving network issues? | no | yes | yes | no | no | yes |
Will GemFire declare a network partition? | no | yes | yes | yes | yes | no |
Will cluster quorum determine properly? | yes | yes | yes | no | no | no |
Clients able to connect during network partition? | yes | no | no | no | no | no |
gfsh able to connect during network partition? |
yes | yes | yes | no | no | yes |
gfsh able to connect after auto-restart? |
n/a | yes | no | no | no | yes |
Clients able to connect any of the network partitions? | yes | no | no | no | no | yes |
Clients able perform region operations? | yes | no | no | no | no | no |
Clients able to connect after network resolved? | yes | yes | yes | no | no | yes |
Manual cluster restart required? | no | no | no | yes | yes | no |
Based on the test results, we can provide guidelines for recovering from each network partition type.
In Type 0, one (1) cache server is isolated.
- Fix network issues if any.
- Restart the isolated cache server if it is unable to rejoin the cluster after the network issues have been resolved.
- Rebalance the cluster.
In Type 1, cache servers are isolated.
- Fix network issues if any.
- Check each isolated member's CPU usage and available system resources.
- Gracefully stop the members not in quorum.
- Identify the members that ran out of system resources.
- Increase system resources as needed for those members.
- Restart the stopped members.
- Wait for GemFire to auto-recover the restarted members.
- Once the restarted members have successfully rejoined the cluster, check for data loss.
- GemFire is expected to fully recover persistent data.
- Reingest non-persistent data.
❗ Be aware that auto-restart can take a long time to complete depending on the data volume, disk speeds, and the number of failed members. For our simple cluster, it took nearly five (5) minutes. Expect a much longer time to complete for larger clusters.
In Type 2, locators and cache servers are isolated.
- Fix network issues if any.
- Check each isolated member's CPU usage and available system resources. Members here refer to both cache servers and locators.
- Gracefully stop the members not in quorum.
- Identify the members that ran out of system resources.
- Increase system resources as needed for those members.
- Restart the stopped locators first and then cache servers.
- Wait for GemFire to auto-recover the restarted members.
- Once the restarted members have successfully rejoined the cluster, check for data loss.
- GemFire is expected to fully recover persistent data.
- Reingest non-persistent data.
In Type 3, a quorum is established with only cache servers. The recovery steps are similar to Type 4.
- Fix network issues if any.
- Check each member to see their CPU usage and available system resources.
- Stop or kill all cache servers and locators.
- Make sure to remove all data stores before restarting the cluster.
- Restart the cluster in a clean state.
- Reingest persistent and non-persistent data.
In Type 4, all locators are isolated. The recovery steps are similar to Type 3.
- Fix network issues if any.
- Check each member to see their CPU usage and available system resources.
- Stop or kill all members including the locators.
- Make sure to remove all data stores before restarting the cluster.
- Restart the cluster in a clean state.
- Reingest persistent and non-persistent data.
In Type 5, members are able to communicate with locators but members themselves are not able to communicate with each other.
- Fix network issues if any.
If there were network issues and they have been resolved:
- The cluster should be back to normal. Data loss is not expected.
- Nonetheless, check for data loss and inconsistency, and be prepared to reingest data as needed.
- Check client applications.
If there were no network issues:
- Check each member's CPU usage and available system resources.
- Gracefully stop cache servers not in quorum. Locators need not be restarted.
- Identify cache servers that ran out of system resources.
- Increase system resources as needed for those cache servers.
- Restart the stopped cache servers.
- Wait for GemFire to auto-recover.
- Once the restarted cache servers have successfully rejoined the cluster, check for data loss.
- Data loss is not expected. Nonetheless, check for data loss and inconsistency, and be prepared to reingest data as needed.
- Check client applications.
From your host OS or any of the VMs, execute the following:
stop_cluster -cluster sb -all
From your host OS, execute the following:
# This stops the VMs. You can restart them by running 'start_pod' later.
stop_pod -pod pod_sb
From your host OS, execute the following:
# This removes the Vagrant VMs.
remove_pod -pod pod_sb
- How Network Partitioning Management Works, https://docs.vmware.com/en/VMware-Tanzu-GemFire/9.15/tgf/GUID-managing-network_partitioning-how_network_partitioning_management_works.html
- Failure Detection and Membership Views, https://docs.vmware.com/en/VMware-Tanzu-GemFire/9.15/tgf/GUID-managing-network_partitioning-failure_detection.html
- Membership Coordinators, Lead Members and Member Weighting, https://docs.vmware.com/en/VMware-Tanzu-GemFire/9.15/tgf/GUID-managing-network_partitioning-membership_coordinators_lead_members_and_weighting.html
- Log Messages and Solutions, https://docs.vmware.com/en/VMware-Tanzu-GemFire/9.15/tgf/GUID-managing-troubleshooting-log_messages_and_solutions.html?hWord=N4IghgNiBcIM4BcCuBjA1gAgQCwE4FMwATEAXyA
PadoGrid | Catalogs | Manual | FAQ | Releases | Templates | Pods | Kubernetes | Docker | Apps | Quick Start