Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't take snapshot error with replicas #2266

Closed
sameervitian opened this issue Mar 27, 2018 · 6 comments
Closed

Couldn't take snapshot error with replicas #2266

sameervitian opened this issue Mar 27, 2018 · 6 comments
Assignees
Labels
kind/bug Something is broken.
Milestone

Comments

@sameervitian
Copy link
Contributor

sameervitian commented Mar 27, 2018

If you suspect this could be a bug, follow the template.

  • What version of Dgraph are you using?
    Dgraph version : v1.0.4-dev
    Commit SHA-1 : 807976c
    Commit timestamp : 2018-03-22 14:55:24 +1100
    Branch : HEAD

  • Have you tried reproducing the issue with latest release?
    Yes

  • What is the hardware spec (RAM, OS)?
    ubuntu 14.04 / 16 core 32GB

  • Steps to reproduce the issue (command/config used to run Dgraph).
    config for dgraph

export: export

gentlecommit: 0.33

idx: 1

memory_mb: 16087.0

trace: 0.33

postings: /data/dgraph/p

wal: /data/dgraph/w

debugmode: False

bindall: True

my: "<server_ip>:7080"

zero: "<zero_ip>:5080"

3 dgraph servers running in cluster with replica 3. I see from /state that all nodes are in groupId 1. Following starts appearing regularly in logs, seems something is wrong.

2018/03/27 13:04:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644079]
2018/03/27 13:04:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644477]
2018/03/27 13:05:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644859]
2018/03/27 13:05:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [645205]
2018/03/27 13:06:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [645612]
2018/03/27 13:06:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [646052]
2018/03/27 13:07:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [646430]
2018/03/27 13:07:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [646890]
2018/03/27 13:08:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [647264]

Along with that I see that the load average is in range 6-13 in all servers. I am running this in production. The cpu utilization is very less and I am using SSD for data.
following is the cpu metrics from top-

 (%Cpu(s):  4.2 us,  0.7 sy,  0.0 ni, 94.8 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st )

this is what I see in vmstat -

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 219784 201912 12934544    0    0     2    37    5    7  2  0 98  0  0
 2  0      0 219608 201912 12934552    0    0     0   592 5106 8252 12  1 87  0  0
 0  0      0 219448 201912 12934560    0    0     0   704 4640 7944  2  0 97  0  0
 0  1      0 219672 201912 12934568    0    0     0   300 2609 4504  1  0 99  0  0
 0  0      0 219736 201912 12934568    0    0     0   200 2193 3796  0  0 99  0  0
 0  0      0 219576 201912 12934588    0    0     4   688 4512 7827  1  1 97  0  0
 0  0      0 219576 201912 12934588    0    0     0   208 3463 5337  8  1 91  0  0
 0  0      0 219640 201912 12934604    0    0     8   480 3902 6796  1  0 98  0  0
 1  0      0 219320 201912 12934612    0    0     4   660 4703 8094  2  1 96  1  0
129  0      0 219704 201920 12934612    0    0     0   108 3388 2999 91  0  8  0  0

iostat result -

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.24    0.00    0.62    0.37    0.00   95.77

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda               0.00         0.00         0.00          0          0
vdb             100.00         0.00       416.00          0        416
dm-0            103.00         0.00       416.00          0        416
  • Expected behaviour and actual result.

As cpu idle time is high and wait time is less, I expect the load average to be less. Also is the logs appearing frequently alarming?
Could someone check what is wrong here.

@sameervitian sameervitian changed the title Dgraph high load Dgraph high load avg Mar 27, 2018
@pawanrawal
Copy link
Contributor

Hey @sameervitian

The below error is an interesting one and linked to #2254. I am investigating this. What about the other replicas, do they also have similar logs?

2018/03/27 13:04:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644079]
2018/03/27 13:04:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644477]
2018/03/27 13:05:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644859]

Is the high load average causing any performance degradation?

@sameervitian
Copy link
Contributor Author

@pawanrawal yes, others also have similar logs.

I dont have a reference point to measure if there is degradation in performance. I could see that our read calls are taking average of ~200ms.( /p folder is of 15GB). The machine used is 32GB 16core so definitely would want to degrade the machine to cut down cost.

@pawanrawal
Copy link
Contributor

I was able to reproduce the Couldn't take snapshot issue with a cluster which has replicas and am working on a fix.

@pawanrawal pawanrawal changed the title Dgraph high load avg Couldn't take snapshot error with replicas Mar 28, 2018
@pawanrawal pawanrawal added the kind/bug Something is broken. label Mar 28, 2018
@pawanrawal pawanrawal self-assigned this Mar 28, 2018
@sameervitian
Copy link
Contributor Author

sameervitian commented Mar 28, 2018

@pawanrawal is Couldn't take snapshot issue also the reason behind high load avg ?

@pawanrawal
Copy link
Contributor

I am not sure, how do you get this load average value? The Couldn't take snapshot issue could have caused OOM issues. It has been fixed in master with f66c7df and the nighly binary should be updated soon. Also since you have got a 16 core machine, load average less than that should be fine.

@dgraph-bot dgraph-bot added the kind/bug Something is broken. label Mar 28, 2018
@manishrjain manishrjain added this to the Sprint-000 milestone Mar 28, 2018
@pawanrawal
Copy link
Contributor

Closing this as the fix for the main issue which was a bug has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

4 participants