Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [laion1b-test] querynode terminated due to a SIGSEV error #29417

Closed
1 task done
ThreadDao opened this issue Dec 22, 2023 · 10 comments
Closed
1 task done

[Bug]: [laion1b-test] querynode terminated due to a SIGSEV error #29417

ThreadDao opened this issue Dec 22, 2023 · 10 comments
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: cardinal-milvus-io-2.3-af54ce9-20231219
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.3
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. milvus with config
  config:
    dataCoord:
      compaction:
        rpcTimeout: 600 
      enableActiveStandby: true
      segment:
        expansionRate: 1.15
        maxSize: 4096
        sealProportion: 0.08
    dataNode:
      dataSync:
        maxParallelSyncTaskNum: 64
    log:
      level: debug
    minio:
      accessKeyID: xxx
      bucketName: xx
      secretAccessKey: xxx
    queryCoord:
      enableActiveStandby: true
    rootCoord:
      dmlChannelNum: 16
      enableActiveStandby: true
  1. concurrent search + query + insert_delete_flush -> querynode terminated with 139 error code
    image

image

Loki log: https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22B%22,%22expr%22:%22%7Bcluster%3D%5C%224am%5C%22,namespace%3D%5C%22qa-milvus%5C%22,pod%3D%5C%22laion1b-test-2-milvus-querynode-d9f5478d6-v84kc%5C%22%7D%22,%22hide%22:false%7D%5D,%22range%22:%7B%22from%22:%221703183663000%22,%22to%22:%221703183700000%22%7D%7D

Expected Behavior

No response

Steps To Reproduce

argo: https://argo-workflows.zilliz.cc/workflows/qa/laion1b-test-cron-pr6vl?tab=workflow&nodeId=laion1b-test-cron-pr6vl&nodePanelView=inputs-outputs

Milvus Log

qn_v84kc.log

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 22, 2023
@ThreadDao ThreadDao added this to the 2.3.4 milestone Dec 22, 2023
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Dec 22, 2023
@yanliang567
Copy link
Contributor

/assign @chyezh
/unassign

@sre-ci-robot sre-ci-robot assigned chyezh and unassigned yanliang567 Dec 22, 2023
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 22, 2023
@chyezh
Copy link
Contributor

chyezh commented Dec 26, 2023

From the coredump file.

(gdb) print $_siginfo
$5 = {si_signo = 11, si_errno = 0, si_code = -6, _sifields = {_pad = {8,
      0 <repeats 27 times>}, _kill = {si_pid = 8, si_uid = 0}, _timer = {si_tid = 8,
      si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {
      si_pid = 8, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}},
    _sigchld = {si_pid = 8, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0},
    _sigfault = {si_addr = 0x8, _addr_lsb = 0, _addr_bnd = {_lower = 0x0,
        _upper = 0x0}}, _sigpoll = {si_band = 8, si_fd = 0}}}

(gdb) bt
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:154
#1  0x0000000001a3e0f0 in runtime.raisebadsignal (sig=11, c=0x7f54f21deb90)
    at /usr/local/go/src/runtime/signal_unix.go:947
#2  0x0000000001a3e48c in runtime.badsignal (sig=11, c=0x7f54f21deb90)
    at /usr/local/go/src/runtime/signal_unix.go:1064
#3  0x0000000001a3cce5 in runtime.sigtrampgo (sig=11, info=0x7f54f21ded30,
    ctx=0x7f54f21dec00) at /usr/local/go/src/runtime/signal_unix.go:461
#4  0x0000000001a61666 in runtime.sigtramp ()
    at /usr/local/go/src/runtime/sys_linux_amd64.s:354
#5  0x00007f57f98f93c0 in ?? () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x0000000000000007 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb) frame 3
#3  0x0000000001a3cce5 in runtime.sigtrampgo (sig=11, info=0x7f54f21ded30,
    ctx=0x7f54f21dec00) at /usr/local/go/src/runtime/signal_unix.go:461
461			badsignal(uintptr(sig), c)
(gdb) print *info
$6 = {siginfoFields = {si_signo = 11, si_errno = 0, si_code = 128, si_addr = 0},
  _ = '\000' <repeats 103 times>}
  • si_signo=11 si_code = -6 is generated by tgkill.
  • real signal is si_signo=11 si_code=128, but I cannot found where the si_code=128 came from.

@chyezh
Copy link
Contributor

chyezh commented Dec 26, 2023

  • In Milvus Builder environment, there's no si_code=128 defined for SIGSEGV.
root@24aa49a90d4d:/usr/include# grep -A 20 'SIGSEGV si_codes' asm-generic/siginfo.h
 * SIGSEGV si_codes
 */
#define SEGV_MAPERR	1	/* address not mapped to object */
#define SEGV_ACCERR	2	/* invalid permissions for mapped object */
#define SEGV_BNDERR	3	/* failed address bound checks */
#ifdef __ia64__
# define __SEGV_PSTKOVF	4	/* paragraph stack overflow */
#else
# define SEGV_PKUERR	4	/* failed protection key checks */
#endif
#define SEGV_ACCADI	5	/* ADI not enabled for mapped object */
#define SEGV_ADIDERR	6	/* Disrupting MCD error */
#define SEGV_ADIPERR	7	/* Precise MCD exception */
#define NSIGSEGV	7
  • It seems that the signal is sent by kernel. Is there some systemcall crash?
#define SI_KERNEL	0x80		/* sent by the kernel from somewhere */

@xiaofan-luan
Copy link
Collaborator

is this only happened on cardinal?
@liliu-z

@liliu-z
Copy link
Member

liliu-z commented Dec 26, 2023

Checking

@liliu-z
Copy link
Member

liliu-z commented Dec 26, 2023

/assign @foxspy

@chyezh
Copy link
Contributor

chyezh commented Dec 26, 2023

@foxspy coredump can be found at path in pod laion1b-test-2-milvus-querynode-89866ddf4-56vdc:/tmp/cores

dlv is installed at /root/go/bin/dlv

previous coredump analysis in based on the filecore-laion1b-test-2-milvus-querynode-89866ddf4-56vdc-cardinal_search-8-1703254037

@chyezh
Copy link
Contributor

chyezh commented Dec 27, 2023

  • In Milvus Builder environment, there's no si_code=128 defined for SIGSEGV.
root@24aa49a90d4d:/usr/include# grep -A 20 'SIGSEGV si_codes' asm-generic/siginfo.h
 * SIGSEGV si_codes
 */
#define SEGV_MAPERR	1	/* address not mapped to object */
#define SEGV_ACCERR	2	/* invalid permissions for mapped object */
#define SEGV_BNDERR	3	/* failed address bound checks */
#ifdef __ia64__
# define __SEGV_PSTKOVF	4	/* paragraph stack overflow */
#else
# define SEGV_PKUERR	4	/* failed protection key checks */
#endif
#define SEGV_ACCADI	5	/* ADI not enabled for mapped object */
#define SEGV_ADIDERR	6	/* Disrupting MCD error */
#define SEGV_ADIPERR	7	/* Precise MCD exception */
#define NSIGSEGV	7
  • It seems that the signal is sent by kernel. Is there some systemcall crash?
#define SI_KERNEL	0x80		/* sent by the kernel from somewhere */

SI_KERNEL is found at #29339. So it may be not sent by system call.

@chyezh
Copy link
Contributor

chyezh commented Dec 29, 2023

3 crash more is found at laion1b-test-2-milvus-querynode-f65bb9464-mpgfx.

(gdb) print *info
$6 = {siginfoFields = {si_signo = 11, si_errno = 0, si_code = 1, si_addr = 140661649626852}, _ = '\000' <repeats 103 times>}

(gdb) print *info
$2 = {siginfoFields = {si_signo = 11, si_errno = 0, si_code = 1, si_addr = 16}, _ = '\000' <repeats 103 times>}

(gdb) print *info
$1 = {siginfoFields = {si_signo = 11, si_errno = 0, si_code = 1, si_addr = 140051262240128}, _ = '\000' <repeats 103 times>}

si_code can be SEGV_MAPERR now.

sre-ci-robot pushed a commit that referenced this issue Jan 3, 2024
…dex (#29628)

Cherry pick from master
pr: #29627 
related to: #29417

Signed-off-by: xianliang <xianliang.li@zilliz.com>
czs007 pushed a commit that referenced this issue Jan 7, 2024
related to : #29417 

cardinal indexes upload index files in `Serialize` interface, and throw
exception when the `Serialize` failed.

Signed-off-by: xianliang <xianliang.li@zilliz.com>
@yanliang567 yanliang567 modified the milestones: 2.3.4, 2.3.5 Jan 12, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.5, 2.3.6 Jan 22, 2024
@ThreadDao
Copy link
Contributor Author

No reproduce cardinal-foxspy-fix_binlog_index_concurrency-1367e8e-20240114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants