Skip to content

every node for itself #2215

@frail

Description

@frail

Setup :

5 similar nodes : 

btrainer-1.182  (192.168.1.182) (Current Master before incident) 
btrainer-1.186 (192.168.1.186)
btrainer-1.136  (192.168.1.136)
btrainer-13.137 (192.168.13.137)
btrainer-1.138  (192.168.1.138)

ES Configs : (version : 0.19.8)

cluster.name: btrainer
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ "192.168.1.182:10300", "192.168.1.186:10300", "192.168.1.136:10300", "192.168.13.137:10300", "192.168.1.138:10300" ]
http.port: 10200
index.number_of_replicas: 4
transport.tcp.port: 10300

Java Options :

-Des-foreground=yes 
-Des.path.home=/elasticsearch 
-Xms4096m 
-Xmx20480m 
-Djline.enabled=true 
-XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled 
-XX:SurvivorRatio=8 
-XX:MaxTenuringThreshold=1 
-XX:CMSInitiatingOccupancyFraction=75 
-XX:+UseCMSInitiatingOccupancyOnly 
-cp /elasticsearch/lib/*:/elasticsearch/lib/sigar/* 
org.elasticsearch.bootstrap.ElasticSearch

Problem :

This problem repeats itself every 5-12 hours period. When everything running smoothly (cluster is green) 1 node goes down and everynode creates its own cluster (not 1/4 split, 1/1/1/1/1 split). The sample problem happened exactly at 22:06, we have a job checking cluster state every minute. This cluster mainly used for training so we have heavy traffic spikes on both reads and writes when jobs are triggered (also some continious small reads).

  1. What happened to btrainer-1.138 ?
  2. Even if 1 node (btrainer-1.138) behaves irrationally why didn't the cluster split by 1/4; why did other nodes lose the master btrainer-1.182 ?

Logs :

you can check the logs from the nodes : https://gist.github.com/3510448

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions