-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549
Comments
/assign @jiaoew1991 |
mark as high priority, please take a look on it |
: A timeout exceeded while waiting to procee d with the request -> I think this might be related to minio quota exceeded |
@jiaoew1991 How many delta logs are there in total, how many stats logs are there in total? Also we need to think of how to handle these bugs |
@Cactus-L |
Actually I don't do a lot of operation. |
And do u know how to improve minio rate limited? I will try to modify it and watch the service status. |
// upsert data: // search data: |
the reason of frequent delete is because of upsert. each upsert triggers one delete and insert. How many tps are you looking for? could you offer the datanode log? |
why all insert has to be upsert? |
could you give us a detailed scenario you need to update the vector that frequent? |
@yanliang567 |
I think this due to frequent delete -> causing frequent flush/compaction -> Minio can not serve the write concurrency. @aoiasd can you investigate on the reason of flush happens? if it's due to memory limit, add more datanode could help. |
Similar to the scenario where ES builds a data index, each piece of data has a unique primary key. When real-time data is synchronized to the system, we usually perform an upsert operation because it is impossible to distinguish whether the data is create or update. (if you need to differentiate, you need to pay a higher cost) |
upsert on milvus is very expressive. Maybe you can query and check if the entity exist before you upsert |
There are couple of things to do: Few things to do:
|
The delete mark problem will be fully solved by L0 Delete feature on 2.4. |
if I set dataCoord.compaction.enableAutoCompaction=false, will the problem be alleviated? |
This has nothing to do with compaction. I think the most important thing is to avoid using frequent upsert and delete. |
upgrade to 2.3.4 could help a little bit. |
from the logs and your description, it seems that the frequent upserts are causing a high rate of delete operations, leading to excessive flush and compaction activities. This is likely overwhelming the MinIO's capacity, causing timeouts. To alleviate this issue:
Implementing these measures should help mitigate the issue. |
Thanks for the clarifciation! |
verified no reproduce on 2.4.5, close for now. |
Is there an existing issue for this?
Environment
Current Behavior
[2023/11/17 10:04:19.751 +00:00] [WARN] [datanode/flush_task.go:232] ["flush task error detected"] [error="attempt #0: A timeout exceeded while waiting to proceed with the re quest, please reduce your request rate: attempt #1: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #2: A timeout excee ded while waiting to proceed with the request, please reduce your request rate: attempt #3: A timeout exceeded while waiting to proceed with the request, please reduce your r equest rate: attempt #4: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #5: A timeout exceeded while waiting to procee d with the request, please reduce your request rate: attempt #6: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #7: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #8: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #9: A timeout exceeded while waiting to proceed with the request, please reduce your request rate"] []
1321077 [2023/11/17 10:04:19.751 +00:00] [ERROR] [datanode/flush_manager.go:853] ["flush pack with error, DataNode quit now"] [error="execution failed"] [errorVerbose="execution fail ed\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).getFlushPack\n | \t/go/src/github.com/milvus-io/milvus/ internal/datanode/flush_task.go:233\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n | \t/go/src/github.com/milvus-io/milvus/internal/datan ode/flush_task.go:190\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) execution failed\nError types: (1) *withstack.withStack (2) *errutil.l eafError"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:853\ngithub.c om/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:206"]
1321078 panic: execution failed
1321079
1321080 goroutine 10007367 [running]:
1321081 panic({0x43bdca0, 0xc0d63bcd38})
1321082 /usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc0015a3978 sp=0xc0015a38b8 pc=0x178ec1b
1321083 github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc0d6394dc0)
1321084 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:855 +0x1571 fp=0xc0015a3f78 sp=0xc0015a3978 pc=0x3765e11
1321085 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc12d850a80, 0xc00a094ab0, 0xc1d10824f0)
1321086 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:206 +0xde fp=0xc0015a3fb8 sp=0xc0015a3f78 pc=0x37675be
1321087 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1.1()
1321088 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:122 +0x2e fp=0xc0015a3fe0 sp=0xc0015a3fb8 pc=0x376680e
1321089 runtime.goexit()
1321090 /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0015a3fe8 sp=0xc0015a3fe0 pc=0x17c89a1
1321091 created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1
1321092 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:122 +0xf8
1321093
1321094 goroutine 1 [chan receive, 22 minutes]:
1321095 runtime.gopark(0xc00027a100?, 0xc001841790?, 0xfe?, 0x2d?, 0x4185fa0?)
1321096 /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0020c7738 sp=0xc0020c7718 pc=0x17920d6
1321097 runtime.chanrecv(0xc0000f4a20, 0x0, 0x1)
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
full log: https://pan.baidu.com/s/1m02d1kp4_x6BQHcFK4tJhg?pwd=mmvh
dd_part.log
Anything else?
No response
The text was updated successfully, but these errors were encountered: