Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Datanode panic: segment not found and search failed: context canceled with timeout 10min #28748

Closed
1 task done
ThreadDao opened this issue Nov 27, 2023 · 5 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

ThreadDao commented Nov 27, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20231124-39be3580
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.2.post1.dev9
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. create collectrion fouram_8KChK8Zc and enable partition_key
  2. create index -> insert 5m -> flush -> load
  3. concurrent search (timeout 10min) + delete + flush + insert -> some search requests failed
[2023-11-24 08:59:20,401 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65535, message=failed to search/query delegator 9 for channel test-delete-key-rootcoord-dml_3_445858528336745947v1: fail to Search, QueryNode ID=9, reason=worker(11) query failed: context canceled)>, <Time:{'RPC start': '2023-11-24 08:59:13.536836', 'RPC error': '2023-11-24 08:59:20.401311'}> (decorators.py:128)
  1. datanode panic
    datanode_pre.log
[2023/11/24 09:04:34.580 +00:00] [ERROR] [conc/options.go:54] ["Conc pool panicked"] [panic="segment not found[segment=445858528339571212]"] [stack="github.com/milvus-io/milvus/pkg/util/conc.(*poolOption).antsOptions.func1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/options.go:54\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1.1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:54\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1.1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:72\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:884\ngithub.com/milvus-io/milvus/internal/datanode/writebuffer.(*writeBufferBase).getSyncTask.func2\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/writebuffer/write_buffer.go:394\ngithub.com/milvus-io/milvus/internal/datanode/syncmgr.(*SyncTask).handleError\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/task.go:77\ngithub.com/milvus-io/milvus/internal/datanode/syncmgr.(*SyncTask).Run\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/task.go:129\ngithub.com/milvus-io/milvus/internal/datanode/syncmgr.(*keyLockDispatcher[...]).Submit.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/key_lock_dispatcher.go:34\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:79\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67"]
panic: segment not found[segment=445858528339571212] [recovered]
    panic: segment not found[segment=445858528339571212] [recovered]
    panic: segment not found[segment=445858528339571212]

goroutine 76486 [running]:
panic({0x4c8f540, 0xc00ac35410})
    /usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc009cfd788 sp=0xc009cfd6c8 pc=0x19ebc5b
github.com/milvus-io/milvus/pkg/util/conc.(*poolOption).antsOptions.func1({0x4c8f540, 0xc00ac35410})
    /go/src/github.com/milvus-io/milvus/pkg/util/conc/options.go:56 +0x15b fp=0xc009cfd850 sp=0xc009cfd788 pc=0x3365f3b
github.com/panjf2000/ants/v2.(*goWorker).run.func1.1()
    /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:54 +0x75 fp=0xc009cfd8c8 sp=0xc009cfd850 pc=0x3363035
runtime.deferCallSave(0xc009cfd998, 0xc009cfdfb8?)

Expected Behavior

No response

Steps To Reproduce

- 4am argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/b92cd6a5-73c0-4a84-805b-336c61b0a84f?nodeId=test-delete-stable-partition-key-2

Milvus Log

test-delete-key-etcd-0                                            1/1     Running            0                 3d11h   10.104.24.85    4am-node29   <none>           <none>
test-delete-key-etcd-1                                            1/1     Running            0                 3d11h   10.104.21.116   4am-node24   <none>           <none>
test-delete-key-etcd-2                                            1/1     Running            0                 3d11h   10.104.17.172   4am-node23   <none>           <none>
test-delete-key-milvus-datanode-cd4d79769-5swfz                   1/1     Running            0                 2d19h   10.104.1.104    4am-node10   <none>           <none>
test-delete-key-milvus-datanode-cd4d79769-8ksgd                   1/1     Running            1 (2d18h ago)     2d19h   10.104.20.13    4am-node22   <none>           <none>
test-delete-key-milvus-indexnode-6756f95c4c-rrgmp                 1/1     Running            0                 2d19h   10.104.18.123   4am-node25   <none>           <none>
test-delete-key-milvus-mixcoord-67785b5858-njndm                  1/1     Running            0                 2d19h   10.104.16.232   4am-node21   <none>           <none>
test-delete-key-milvus-proxy-59fcf9b894-4mhfb                     1/1     Running            0                 2d19h   10.104.12.41    4am-node17   <none>           <none>
test-delete-key-milvus-querynode-5b98987787-4qjkm                 1/1     Running            0                 2d19h   10.104.17.188   4am-node23   <none>           <none>
test-delete-key-milvus-querynode-5b98987787-pnsl8                 1/1     Running            0                 2d19h   10.104.1.105    4am-node10   <none>           <none>
test-delete-key-minio-0                                           1/1     Running            0                 3d11h   10.104.24.86    4am-node29   <none>           <none>
test-delete-key-minio-1                                           1/1     Running            0                 3d11h   10.104.21.115   4am-node24   <none>           <none>
test-delete-key-minio-2                                           1/1     Running            0                 3d11h   10.104.17.173   4am-node23   <none>           <none>
test-delete-key-minio-3                                           1/1     Running            0                 3d11h   10.104.23.205   4am-node27   <none>           <none>
test-delete-key-pulsar-bookie-0                                   1/1     Running            0                 3d11h   10.104.21.118   4am-node24   <none>           <none>
test-delete-key-pulsar-bookie-1                                   1/1     Running            0                 3d11h   10.104.17.174   4am-node23   <none>           <none>
test-delete-key-pulsar-bookie-2                                   1/1     Running            0                 3d11h   10.104.23.209   4am-node27   <none>           <none>
test-delete-key-pulsar-bookie-init-kbp6r                          0/1     Completed          0                 3d11h   10.104.4.153    4am-node11   <none>           <none>
test-delete-key-pulsar-broker-0                                   1/1     Running            0                 3d11h   10.104.1.120    4am-node10   <none>           <none>
test-delete-key-pulsar-proxy-0                                    1/1     Running            0                 3d11h   10.104.14.232   4am-node18   <none>           <none>
test-delete-key-pulsar-pulsar-init-zmmt4                          0/1     Completed          0                 3d11h   10.104.4.152    4am-node11   <none>           <none>
test-delete-key-pulsar-recovery-0                                 1/1     Running            0                 3d11h   10.104.6.228    4am-node13   <none>           <none>
test-delete-key-pulsar-zookeeper-0                                1/1     Running            0                 3d11h   10.104.21.117   4am-node24   <none>           <none>
test-delete-key-pulsar-zookeeper-1                                1/1     Running            0                 3d11h   10.104.17.176   4am-node23   <none>           <none>
test-delete-key-pulsar-zookeeper-2                                1/1     Running            0                 3d11h   10.104.24.88    4am-node29   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 27, 2023
@ThreadDao ThreadDao added this to the 2.4.0 milestone Nov 27, 2023
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Nov 27, 2023
@ThreadDao
Copy link
Contributor Author

/assign @congqixia

@congqixia
Copy link
Contributor

shall be same root cause like #28736
working on it

congqixia added a commit to congqixia/milvus that referenced this issue Nov 27, 2023
Related to milvus-io#28736 milvus-io#28748
See also milvus-io#27675
Previous PR: milvus-io#28646

This PR fixes `SegmentNotFound` issue when compaction happens multiple
times and the buffer of first generation segment is sync due to stale
policy

Now the `CompactSegments` API of metacache shall update the compactTo
field of segmentInfo if the compactTo segment is also compacted to keep
the bloodline clean

Also, add the `CompactedSegment` SyncPolicy to sync the compacted
segment asap to keep metacache clean

Now the `SyncPolicy` is an interface instead of a function type so that
when it selects some segments to sync, we colud log the reason and
target segment

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Nov 27, 2023
…8755)

Related to #28736 #28748
See also #27675
Previous PR: #28646

This PR fixes `SegmentNotFound` issue when compaction happens multiple
times and the buffer of first generation segment is sync due to stale
policy

Now the `CompactSegments` API of metacache shall update the compactTo
field of segmentInfo if the compactTo segment is also compacted to keep
the bloodline clean

Also, add the `CompactedSegment` SyncPolicy to sync the compacted
segment asap to keep metacache clean

Now the `SyncPolicy` is an interface instead of a function type so that
when it selects some segments to sync, we colud log the reason and
target segment

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@yanliang567
Copy link
Contributor

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 27, 2023
@congqixia
Copy link
Contributor

congqixia commented Nov 28, 2023

/assign @ThreadDao
please verify after pr #28755 merged

@ThreadDao
Copy link
Contributor Author

Fixed panic master-20231128-4bd426db-amd64
https://argo-workflows.zilliz.cc/archived-workflows/qa/381aab7a-4a2e-44f4-a92c-a5c59a46aeae
search context canceled will check on another issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants