[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549

Richard-lrg · 2023-11-17T13:27:19Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.3.3
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): jdk
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 32/128g *3
- GPU: 
- Others:

Current Behavior

[2023/11/17 10:04:19.751 +00:00] [WARN] [datanode/flush_task.go:232] ["flush task error detected"] [error="attempt #0: A timeout exceeded while waiting to proceed with the re quest, please reduce your request rate: attempt #1: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #2: A timeout excee ded while waiting to proceed with the request, please reduce your request rate: attempt #3: A timeout exceeded while waiting to proceed with the request, please reduce your r equest rate: attempt #4: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #5: A timeout exceeded while waiting to procee d with the request, please reduce your request rate: attempt #6: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #7: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #8: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #9: A timeout exceeded while waiting to proceed with the request, please reduce your request rate"] []
1321077 [2023/11/17 10:04:19.751 +00:00] [ERROR] [datanode/flush_manager.go:853] ["flush pack with error, DataNode quit now"] [error="execution failed"] [errorVerbose="execution fail ed\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).getFlushPack\n | \t/go/src/github.com/milvus-io/milvus/ internal/datanode/flush_task.go:233\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n | \t/go/src/github.com/milvus-io/milvus/internal/datan ode/flush_task.go:190\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) execution failed\nError types: (1) *withstack.withStack (2) *errutil.l eafError"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:853\ngithub.c om/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:206"]
1321078 panic: execution failed
1321079
1321080 goroutine 10007367 [running]:
1321081 panic({0x43bdca0, 0xc0d63bcd38})
1321082 /usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc0015a3978 sp=0xc0015a38b8 pc=0x178ec1b
1321083 github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc0d6394dc0)
1321084 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:855 +0x1571 fp=0xc0015a3f78 sp=0xc0015a3978 pc=0x3765e11
1321085 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc12d850a80, 0xc00a094ab0, 0xc1d10824f0)
1321086 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:206 +0xde fp=0xc0015a3fb8 sp=0xc0015a3f78 pc=0x37675be
1321087 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1.1()
1321088 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:122 +0x2e fp=0xc0015a3fe0 sp=0xc0015a3fb8 pc=0x376680e
1321089 runtime.goexit()
1321090 /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0015a3fe8 sp=0xc0015a3fe0 pc=0x17c89a1
1321091 created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1
1321092 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:122 +0xf8
1321093
1321094 goroutine 1 [chan receive, 22 minutes]:
1321095 runtime.gopark(0xc00027a100?, 0xc001841790?, 0xfe?, 0x2d?, 0x4185fa0?)
1321096 /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0020c7738 sp=0xc0020c7718 pc=0x17920d6
1321097 runtime.chanrecv(0xc0000f4a20, 0x0, 0x1)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

full log: https://pan.baidu.com/s/1m02d1kp4_x6BQHcFK4tJhg?pwd=mmvh

dd_part.log

Anything else?

No response

The text was updated successfully, but these errors were encountered:

yanliang567 · 2023-11-18T08:37:01Z

/assign @jiaoew1991
/unassign

xiaofan-luan · 2023-11-19T03:20:54Z

mark as high priority, please take a look on it

xiaofan-luan · 2023-11-19T03:23:10Z

: A timeout exceeded while waiting to procee d with the request -> I think this might be related to minio quota exceeded

xiaofan-luan · 2023-11-19T03:27:31Z

@jiaoew1991
can we add a command in birdwatcher, get stats of files in etcd.
Like:

How many delta logs are there in total, how many stats logs are there in total?
This might be very important for such use cases.

Also we need to think of how to handle these bugs

xiaofan-luan · 2023-11-19T03:28:24Z

@Cactus-L
On the other side, could you share you you code?
If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

Richard-lrg · 2023-11-19T12:32:59Z

@Cactus-L On the other side, could you share you you code? If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

Actually I don't do a lot of operation.
I saw from the log that milvus was caused by a large number of deletion operations, but it was not triggered by me. It seemed to be caused by some special problem.
I have detailed instructions in another issue #28514

Richard-lrg · 2023-11-19T12:35:45Z

@Cactus-L On the other side, could you share you you code? If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

And do u know how to improve minio rate limited? I will try to modify it and watch the service status.

Richard-lrg · 2023-11-20T11:26:00Z

@Cactus-L On the other side, could you share you you code? If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

// upsert data:
public void upsertRows(String collection, List<JSONObject> rowList) { UpsertParam upsertParam = UpsertParam.newBuilder() .withCollectionName(collection) .withRows(rowList) .build(); R<MutationResult> responseR = getMilvusClient().upsert(upsertParam); if (Objects.isNull(responseR) || Objects.isNull(responseR.getStatus()) || !responseR.getStatus().equals(R.Status.Success.getCode())) { String message = responseR.getMessage(); if (StringUtils.isNotBlank(message) && message.contains("CollectionNotExists")) { throw new MilvusCollectionNotExistsException(message); } else { throw new MilvusException(message); } } }

// search data:
public SearchResultsWrapper search(List<List<Float>> searchVectors, String collection, String searchFiled, Integer topK, List<String> outputFields, String searchParam, String filterExpr) throws BaseException { collection = collection.replace("-", ""); if (StringUtils.isEmpty(filterExpr)) { filterExpr = ""; } SearchParam requestParam = SearchParam.newBuilder() .withCollectionName(collection) .withConsistencyLevel(ConsistencyLevelEnum.STRONG) .withMetricType(MetricType.IP) .withOutFields(outputFields) .withTopK(topK) .withVectors(searchVectors) .withVectorFieldName(searchFiled) .withParams(searchParam) .withExpr(filterExpr) .build(); R<SearchResults> respSearch = getMilvusClient().withTimeout(MILVUS_SEARCH_TIMEOUT_SECONDS, TimeUnit.SECONDS) .search(requestParam); if (respSearch.getStatus() != R.Status.Success.getCode()) { String exceptionMsg = String.format("search failed, collection : %s, error msg : %s", collection, respSearch.getMessage()); if (respSearch.getMessage() != null && respSearch.getMessage().contains("CollectionNotExists")) { throw new MilvusCollectionNotExistsException(exceptionMsg); } else { throw new MilvusException(exceptionMsg); } } SearchResultsWrapper wrapperSearch = new SearchResultsWrapper(respSearch.getData().getResults()); return wrapperSearch; }

xiaofan-luan · 2023-11-20T11:37:01Z

the reason of frequent delete is because of upsert.

each upsert triggers one delete and insert.

How many tps are you looking for? could you offer the datanode log?

Richard-lrg · 2023-11-20T12:48:58Z

the reason of frequent delete is because of upsert.

each upsert triggers one delete and insert.

How many tps are you looking for? could you offer the datanode log?

1k
here is the full log of datanode

xiaofan-luan · 2023-11-21T01:13:33Z

why all insert has to be upsert?

xiaofan-luan · 2023-11-21T01:14:12Z

could you give us a detailed scenario you need to update the vector that frequent?

xiaofan-luan · 2023-11-21T01:16:11Z

@yanliang567
could we setup a test environment in house with 1000 upsert tps and check?

xiaofan-luan · 2023-11-21T01:22:00Z

I think this due to frequent delete -> causing frequent flush/compaction -> Minio can not serve the write concurrency.

@aoiasd can you investigate on the reason of flush happens? if it's due to memory limit, add more datanode could help.

Richard-lrg · 2023-11-21T02:36:46Z

why all insert has to be upsert?

Similar to the scenario where ES builds a data index, each piece of data has a unique primary key. When real-time data is synchronized to the system, we usually perform an upsert operation because it is impossible to distinguish whether the data is create or update. (if you need to differentiate, you need to pay a higher cost)

xiaofan-luan · 2023-11-21T06:09:48Z

upsert on milvus is very expressive.

Maybe you can query and check if the entity exist before you upsert

xiaofan-luan · 2023-11-21T14:00:12Z

There are couple of things to do:
1.under upsert, we should think of a way to reduce delete mark(may be we can read once on delete or we can handle this on flush) @aoiasd
2. reduce duplicated flush. this happens under syncing in queue. new flush may be submit again and again.
3. Add a jitter, sync with period become a random period so flush won't happen at the same time.

Few things to do:

check your dataNode.segment.syncPeriod to larger value may reduce the flush pressure. 1800 might be a proper value?
scale minio cluster or use ssd(Sorry about this but this could help)

xiaofan-luan · 2023-11-21T14:01:06Z

The delete mark problem will be fully solved by L0 Delete feature on 2.4.
Search for more details

related to #28549 pr: #28626 1. avoid duplicated sync segments under syncing states 2. add jitter to avoid sync segments at the same time Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>

Richard-lrg · 2023-12-27T08:35:54Z

There are couple of things to do: 1.under upsert, we should think of a way to reduce delete mark(may be we can read once on delete or we can handle this on flush) @aoiasd 2. reduce duplicated flush. this happens under syncing in queue. new flush may be submit again and again. 3. Add a jitter, sync with period become a random period so flush won't happen at the same time.

Few things to do:

check your dataNode.segment.syncPeriod to larger value may reduce the flush pressure. 1800 might be a proper value?

scale minio cluster or use ssd(Sorry about this but this could help)

if I set dataCoord.compaction.enableAutoCompaction=false, will the problem be alleviated？
(all the indexType of my collections are FLAT)

xiaofan-luan · 2023-12-29T01:56:22Z

There are couple of things to do: 1.under upsert, we should think of a way to reduce delete mark(may be we can read once on delete or we can handle this on flush) @aoiasd 2. reduce duplicated flush. this happens under syncing in queue. new flush may be submit again and again. 3. Add a jitter, sync with period become a random period so flush won't happen at the same time.
Few things to do:

check your dataNode.segment.syncPeriod to larger value may reduce the flush pressure. 1800 might be a proper value?

scale minio cluster or use ssd(Sorry about this but this could help)

if I set dataCoord.compaction.enableAutoCompaction=false, will the problem be alleviated？ (all the indexType of my collections are FLAT)

This has nothing to do with compaction. I think the most important thing is to avoid using frequent upsert and delete.
Another important action is to make sure you minio cluster has enough IOPs. Try to deploy minio on SSD might help.

xiaofan-luan · 2023-12-29T01:58:11Z

upgrade to 2.3.4 could help a little bit.
Any specific reason we have some much delete?

hemangjoshi37a · 2024-01-19T04:59:01Z

from the logs and your description, it seems that the frequent upserts are causing a high rate of delete operations, leading to excessive flush and compaction activities. This is likely overwhelming the MinIO's capacity, causing timeouts.

To alleviate this issue:

Consider reducing the frequency of upsert operations. Frequent upserts can strain the system, particularly in a high-throughput environment.
Increase MinIO's IOPS capacity, preferably by deploying it on SSDs, to handle the increased load.
Adjust dataNode.segment.syncPeriod to a higher value like 1800 to reduce flush pressure.
Await the L0 Delete feature in version 2.4 for a more robust solution to handle frequent deletes.

Implementing these measures should help mitigate the issue.

xiaofan-luan · 2024-01-19T05:26:08Z

from the logs and your description, it seems that the frequent upserts are causing a high rate of delete operations, leading to excessive flush and compaction activities. This is likely overwhelming the MinIO's capacity, causing timeouts.

To alleviate this issue:

Consider reducing the frequency of upsert operations. Frequent upserts can strain the system, particularly in a high-throughput environment.

Increase MinIO's IOPS capacity, preferably by deploying it on SSDs, to handle the increased load.

Adjust dataNode.segment.syncPeriod to a higher value like 1800 to reduce flush pressure.

Await the L0 Delete feature in version 2.4 for a more robust solution to handle frequent deletes.

Implementing these measures should help mitigate the issue.

Thanks for the clarifciation!

yanliang567 · 2024-07-08T02:38:33Z

verified no reproduce on 2.4.5, close for now.

Richard-lrg added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2023

Richard-lrg assigned yanliang567 Nov 17, 2023

sre-ci-robot assigned jiaoew1991 and unassigned yanliang567 Nov 18, 2023

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2023

yanliang567 added this to the 2.3.4 milestone Nov 18, 2023

jiaoew1991 mentioned this issue Nov 20, 2023

Add a command in birdwatcher, get stats of files in etcd milvus-io/birdwatcher#215

Open

xiaofan-luan mentioned this issue Nov 20, 2023

[Bug]: A large number of segment flush operations suddenly occurred in datacoord, causing the cluster CPU_SERVER_LOADAVG to spike and collection load fail #28514

Closed

1 task

yanliang567 added the priority/urgent Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 21, 2023

xiaofan-luan mentioned this issue Nov 21, 2023

fix: frequent flush cause minio rate limit #28625

Merged

yanliang567 modified the milestones: 2.3.4, 2.3.5 Jan 12, 2024

yanliang567 modified the milestones: 2.3.5, 2.3.6 Jan 22, 2024

yanliang567 modified the milestones: 2.3.6, 2.3.7 Jan 30, 2024

yanliang567 modified the milestones: 2.3.7, 2.3.9, 2.3.10 Feb 18, 2024

yanliang567 modified the milestones: 2.3.10, 2.3.11 Feb 28, 2024

yanliang567 modified the milestones: 2.3.11, 2.3.12 Mar 11, 2024

yanliang567 modified the milestones: 2.3.12, 2.3.13 Mar 22, 2024

yanliang567 modified the milestones: 2.3.13, 2.3.14 Apr 15, 2024

yanliang567 modified the milestones: 2.3.14, 2.3.15 Apr 23, 2024

yanliang567 modified the milestones: 2.3.15, 2.3.16 May 16, 2024

yanliang567 closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549

[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549

Richard-lrg commented Nov 17, 2023

yanliang567 commented Nov 18, 2023

xiaofan-luan commented Nov 19, 2023

xiaofan-luan commented Nov 19, 2023 •

edited

Loading

xiaofan-luan commented Nov 19, 2023

xiaofan-luan commented Nov 19, 2023

Richard-lrg commented Nov 19, 2023

Richard-lrg commented Nov 19, 2023

Richard-lrg commented Nov 20, 2023 •

edited

Loading

xiaofan-luan commented Nov 20, 2023

Richard-lrg commented Nov 20, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

Richard-lrg commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

Richard-lrg commented Dec 27, 2023

xiaofan-luan commented Dec 29, 2023

xiaofan-luan commented Dec 29, 2023

hemangjoshi37a commented Jan 19, 2024

xiaofan-luan commented Jan 19, 2024

yanliang567 commented Jul 8, 2024

[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549

[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549

Comments

Richard-lrg commented Nov 17, 2023

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Nov 18, 2023

xiaofan-luan commented Nov 19, 2023

xiaofan-luan commented Nov 19, 2023 • edited Loading

xiaofan-luan commented Nov 19, 2023

xiaofan-luan commented Nov 19, 2023

Richard-lrg commented Nov 19, 2023

Richard-lrg commented Nov 19, 2023

Richard-lrg commented Nov 20, 2023 • edited Loading

xiaofan-luan commented Nov 20, 2023

Richard-lrg commented Nov 20, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

Richard-lrg commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

xiaofan-luan commented Nov 21, 2023

Richard-lrg commented Dec 27, 2023

xiaofan-luan commented Dec 29, 2023

xiaofan-luan commented Dec 29, 2023

hemangjoshi37a commented Jan 19, 2024

xiaofan-luan commented Jan 19, 2024

yanliang567 commented Jul 8, 2024

xiaofan-luan commented Nov 19, 2023 •

edited

Loading

Richard-lrg commented Nov 20, 2023 •

edited

Loading