-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Description
Setup :
5 similar nodes :
btrainer-1.182 (192.168.1.182) (Current Master before incident)
btrainer-1.186 (192.168.1.186)
btrainer-1.136 (192.168.1.136)
btrainer-13.137 (192.168.13.137)
btrainer-1.138 (192.168.1.138)
ES Configs : (version : 0.19.8)
cluster.name: btrainer
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ "192.168.1.182:10300", "192.168.1.186:10300", "192.168.1.136:10300", "192.168.13.137:10300", "192.168.1.138:10300" ]
http.port: 10200
index.number_of_replicas: 4
transport.tcp.port: 10300
Java Options :
-Des-foreground=yes
-Des.path.home=/elasticsearch
-Xms4096m
-Xmx20480m
-Djline.enabled=true
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-cp /elasticsearch/lib/*:/elasticsearch/lib/sigar/*
org.elasticsearch.bootstrap.ElasticSearch
Problem :
This problem repeats itself every 5-12 hours period. When everything running smoothly (cluster is green) 1 node goes down and everynode creates its own cluster (not 1/4 split, 1/1/1/1/1 split). The sample problem happened exactly at 22:06, we have a job checking cluster state every minute. This cluster mainly used for training so we have heavy traffic spikes on both reads and writes when jobs are triggered (also some continious small reads).
- What happened to btrainer-1.138 ?
- Even if 1 node (btrainer-1.138) behaves irrationally why didn't the cluster split by 1/4; why did other nodes lose the master btrainer-1.182 ?
Logs :
you can check the logs from the nodes : https://gist.github.com/3510448
Metadata
Metadata
Assignees
Labels
No labels