Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when online write #5456

Closed
JimWen opened this issue May 18, 2020 · 20 comments · Fixed by dgraph-io/badger#1337
Closed

Crash when online write #5456

JimWen opened this issue May 18, 2020 · 20 comments · Fixed by dgraph-io/badger#1337
Assignees
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it.

Comments

@JimWen
Copy link

JimWen commented May 18, 2020

What version of Dgraph are you using?

Dgraph version : v1.2.4
Dgraph SHA-256 : 38ca7ecc19103d21bdd1eddf0d2695e872883b37d4c270916606b0b51f1d0a1b
Commit SHA-1 : b51171c
Commit timestamp : 2020-05-15 15:42:10 -0700
Branch : HEAD
Go version : go1.13.6

Have you tried reproducing the issue with the latest release?

v1.2.4

What is the hardware spec (RAM, OS)?

128G mem & 1.8T SSD

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

Steps to reproduce the issue (command/config used to run Dgraph).

Just online write with java api, its wierd.

Expected behaviour and actual result.

Alpha crah and cluster is blocked. The log is as followings

`
panic: runtime error: slice bounds out of range [:135467521] with capacity 7968

goroutine 10106572 [running]:
github.com/dgraph-io/badger/v2/table.(*blockIterator).setIdx(0xc001b41120, 0x0)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/iterator.go:89 +0x52c
github.com/dgraph-io/badger/v2/table.(*blockIterator).seekToFirst(...)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/iterator.go:147
github.com/dgraph-io/badger/v2/table.(*Iterator).next(0xc001b41110)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/iterator.go:314 +0x12b
github.com/dgraph-io/badger/v2/table.(*Iterator).next(0xc001b41110)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/iterator.go:323 +0x1b3
github.com/dgraph-io/badger/v2/table.(*Iterator).Next(0xc001b41110)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/iterator.go:379 +0x47
github.com/dgraph-io/badger/v2/table.(*ConcatIterator).Next(0xc682d840a0)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/iterator.go:497 +0x33
github.com/dgraph-io/badger/v2/table.(*node).next(0xc112aec210)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/merge_iterator.go:82 +0x3d
github.com/dgraph-io/badger/v2/table.(*MergeIterator).Next(0xc112aec210)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/merge_iterator.go:158 +0x38
github.com/dgraph-io/badger/v2/table.(*node).next(0xc112aec300)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/merge_iterator.go:80 +0x72
github.com/dgraph-io/badger/v2/table.(*MergeIterator).Next(0xc112aec2c0)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/table/merge_iterator.go:158 +0x38
github.com/dgraph-io/badger/v2.(*Iterator).parseItem(0xc22baf0000, 0x1)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/iterator.go:621 +0x17f
github.com/dgraph-io/badger/v2.(*Iterator).Next(0xc22baf0000)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/iterator.go:566 +0x10b
github.com/dgraph-io/badger/v2.(*Stream).produceKVs.func1(0xc83b156a40, 0x17, 0x20, 0xc83b156a60, 0x17, 0x20, 0x0, 0x0, 0x0)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/stream.go:176 +0xa48
github.com/dgraph-io/badger/v2.(*Stream).produceKVs(0xc5d2ba5ea0, 0x1a971a0, 0xc000038118, 0x0, 0x0)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/stream.go:234 +0x2c6
github.com/dgraph-io/badger/v2.(*Stream).Orchestrate.func1(0xc5d4a7daa0, 0xc5d2ba5ea0, 0x1a971a0, 0xc000038118, 0xc5d4a53ec0)
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/stream.go:337 +0x7f
created by github.com/dgraph-io/badger/v2.(*Stream).Orchestrate
/root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200515210839-ef28ef36b592/stream.go:334 +0x175
`

@JimWen
Copy link
Author

JimWen commented May 18, 2020

When i shutdown and restart alpha with v1.2.4, the problem is still there. But with v1.2.3, there is no problem.

@jarifibrahim jarifibrahim self-assigned this May 18, 2020
@jarifibrahim jarifibrahim added area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. labels May 18, 2020
@jarifibrahim
Copy link
Contributor

I've seen similar failures in badger on PR dgraph-io/badger#1308 . This is an issue in badger.

@harpy-wings
Copy link

harpy-wings commented May 18, 2020

we are on heavy load.the server crash every 10 minutes . how we can handle it?
increasing alpha groups could be an option ?

@jarifibrahim
Copy link
Contributor

@h4midr Can you downgrade to v20.03.1? The latest version has this failure. @JimWen you should also switch to v1.2.3 if the crash happens frequently.

We're working on fixing this and we'll do a patch release soon.

@harpy-wings
Copy link

my dgraph version is 20.03.2
i recorded my screen and i can share it with you. and i really need to find an option. there is project working on it more than 1.5 years. we grow up with you and now , in real use and the first global use,we screwed. if you interested in records or can help me , i really looking forward to hearing from you.
hassaniyaqoti@gmail.com

@jarifibrahim
Copy link
Contributor

@h4midr Did you try downgrading your version? Dgraph v20.03.1 should work fine.

@lgalatin lgalatin added the status/accepted We accept to investigate/work on it. label May 18, 2020
@danielmai
Copy link
Contributor

@h4midr Have you tried using Dgraph v20.03.1 and are still seeing this issue? We have reached out to you over email as well.

@JimWen
Copy link
Author

JimWen commented May 19, 2020

@h4midr Can you downgrade to v20.03.1? The latest version has this failure. @JimWen you should also switch to v1.2.3 if the crash happens frequently.

We're working on fixing this and we'll do a patch release soon.

yes v1.2.3 is fine

@harpy-wings
Copy link

@h4midr Have you tried using Dgraph v20.03.1 and are still seeing this issue? We have reached out to you over email as well.

i'll do it but we are not in situation that have options for test.as soon as exams end i'm going to simulate the process to see how it happens again.
thank you all friends.

@jarifibrahim
Copy link
Contributor

@JimWen @h4midr when does the crash happen? Do you have any logs or any information about when the crash shows up? I'm trying to reproduce the failure.

@JimWen
Copy link
Author

JimWen commented May 19, 2020

@JimWen @h4midr when does the crash happen? Do you have any logs or any information about when the crash shows up? I'm trying to reproduce the failure.

Replace with lastest v1.2.4, crash happen. log related is in the issue.

@harpy-wings
Copy link

it sounds like when the mutations request increase, it happens.the duration between two failures depends on request rates.
in low rates. it's fine.
in reading request it's fine.
i can help you to reproduce it. but after my this work done here on 28th May.

@jarifibrahim
Copy link
Contributor

jarifibrahim commented May 19, 2020

This is a bug in badger and not in dgraph. It can be reproduced by the following steps

git clone https://github.com/dgraph-io/badger
cd badger
git fetch origin ibrahim/panic-test
git checkout ibrahim/panic-test
cd badger
go run main.go bank test --dir ./badgerdb -d 24h 2>&1

@jarifibrahim jarifibrahim added the priority/P0 Critical issue that requires immediate attention. label May 19, 2020
@harpy-wings
Copy link

so finally. do version 1.2.3 has this problem too?

@jarifibrahim
Copy link
Contributor

@h4midr No, this affects only version 1.2.4 and v20.03.2 .

jarifibrahim pushed a commit to dgraph-io/badger that referenced this issue May 22, 2020
Fixes hypermodeinc/dgraph#5456 .

This PR fixes the crash that could occur when a block was
read from the cache.
There was a logical race condition. The following sequence
of events could occur which would cause the crash.
1. An iterator makes `t.Block(idx)` call
2. The `t.Block` function finds the block in the cache.
   The newly found block has `ref=1` which means it was
   held only by the cache.
3. The `t.Block` function is holding the block and at the
   same time the block gets evicted from the cache. The
   existing ref of the block was `1` so the cache eviction
   would decrement the ref and make it `0`. When the ref
   becomes `0`, the block is added to the `sync.Pool` and
   is ready to be reused.
4. While the block got evicted from the cache, the iterator
   received the block and it incremented the ref from
   `0` to `1` and starts using this.

Since the block was inserted into the syncPool in the 3rd
event, it could be modified by anyone while the iterator is
using it.
@jarifibrahim
Copy link
Contributor

This is not yet fixed in Dgraph. I'll close this issue once badger is updated in Dgraph.

@jarifibrahim
Copy link
Contributor

This has been fixed by updating badger in #5404

@jarifibrahim
Copy link
Contributor

Hey @JimWen and @h4midr, we've released a new patch release candidate with the fix for this issue https://github.com/dgraph-io/dgraph/releases v20.03.3-rc1 and v1.2.5-rc1 . I would really appreciate it if you guys can help us test it.

We believe it is fixed but it would be very helpful if someone running dgraph can also confirm it.

@harpy-wings
Copy link

thank you very much.
we are in the production environment and our customers are not very bored.but i try to test it locally.

@JimWen
Copy link
Author

JimWen commented Jun 4, 2020

Hey @JimWen and @h4midr, we've released a new patch release candidate with the fix for this issue https://github.com/dgraph-io/dgraph/releases v20.03.3-rc1 and v1.2.5-rc1 . I would really appreciate it if you guys can help us test it.

We believe it is fixed but it would be very helpful if someone running dgraph can also confirm it.

i test with v20.03.3, online write has worked fine until now for 2 days. @jarifibrahim
but it seems to introduce another crash , please have a look #5573

jarifibrahim pushed a commit to dgraph-io/badger that referenced this issue Oct 2, 2020
Fixes hypermodeinc/dgraph#5456 .

This PR fixes the crash that could occur when a block was
read from the cache.
There was a logical race condition. The following sequence
of events could occur which would cause the crash.
1. An iterator makes `t.Block(idx)` call
2. The `t.Block` function finds the block in the cache.
   The newly found block has `ref=1` which means it was
   held only by the cache.
3. The `t.Block` function is holding the block and at the
   same time the block gets evicted from the cache. The
   existing ref of the block was `1` so the cache eviction
   would decrement the ref and make it `0`. When the ref
   becomes `0`, the block is added to the `sync.Pool` and
   is ready to be reused.
4. While the block got evicted from the cache, the iterator
   received the block and it incremented the ref from
   `0` to `1` and starts using this.

Since the block was inserted into the syncPool in the 3rd
event, it could be modified by anyone while the iterator is
using it.
manishrjain pushed a commit to outcaste-io/outserv that referenced this issue Jul 6, 2022
Fixes hypermodeinc/dgraph#5456 .

This PR fixes the crash that could occur when a block was
read from the cache.
There was a logical race condition. The following sequence
of events could occur which would cause the crash.
1. An iterator makes `t.Block(idx)` call
2. The `t.Block` function finds the block in the cache.
   The newly found block has `ref=1` which means it was
   held only by the cache.
3. The `t.Block` function is holding the block and at the
   same time the block gets evicted from the cache. The
   existing ref of the block was `1` so the cache eviction
   would decrement the ref and make it `0`. When the ref
   becomes `0`, the block is added to the `sync.Pool` and
   is ready to be reused.
4. While the block got evicted from the cache, the iterator
   received the block and it incremented the ref from
   `0` to `1` and starts using this.

Since the block was inserted into the syncPool in the 3rd
event, it could be modified by anyone while the iterator is
using it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it.
Development

Successfully merging a pull request may close this issue.

5 participants