Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549

Closed
1 task done
Richard-lrg opened this issue Nov 17, 2023 · 27 comments
Closed
1 task done

[Bug]: [datanode]flush pack with error, DataNode quit now. panic! #28549

Richard-lrg opened this issue Nov 17, 2023 · 27 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@Richard-lrg
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.3.3
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): jdk
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 32/128g *3
- GPU: 
- Others:

Current Behavior

[2023/11/17 10:04:19.751 +00:00] [WARN] [datanode/flush_task.go:232] ["flush task error detected"] [error="attempt #0: A timeout exceeded while waiting to proceed with the re quest, please reduce your request rate: attempt #1: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #2: A timeout excee ded while waiting to proceed with the request, please reduce your request rate: attempt #3: A timeout exceeded while waiting to proceed with the request, please reduce your r equest rate: attempt #4: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #5: A timeout exceeded while waiting to procee d with the request, please reduce your request rate: attempt #6: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #7: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #8: A timeout exceeded while waiting to proceed with the request, please reduce your request rate: attempt #9: A timeout exceeded while waiting to proceed with the request, please reduce your request rate"] []
1321077 [2023/11/17 10:04:19.751 +00:00] [ERROR] [datanode/flush_manager.go:853] ["flush pack with error, DataNode quit now"] [error="execution failed"] [errorVerbose="execution fail ed\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).getFlushPack\n | \t/go/src/github.com/milvus-io/milvus/ internal/datanode/flush_task.go:233\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n | \t/go/src/github.com/milvus-io/milvus/internal/datan ode/flush_task.go:190\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) execution failed\nError types: (1) *withstack.withStack (2) *errutil.l eafError"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:853\ngithub.c om/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:206"]
1321078 panic: execution failed
1321079
1321080 goroutine 10007367 [running]:
1321081 panic({0x43bdca0, 0xc0d63bcd38})
1321082 /usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc0015a3978 sp=0xc0015a38b8 pc=0x178ec1b
1321083 github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc0d6394dc0)
1321084 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:855 +0x1571 fp=0xc0015a3f78 sp=0xc0015a3978 pc=0x3765e11
1321085 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc12d850a80, 0xc00a094ab0, 0xc1d10824f0)
1321086 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:206 +0xde fp=0xc0015a3fb8 sp=0xc0015a3f78 pc=0x37675be
1321087 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1.1()
1321088 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:122 +0x2e fp=0xc0015a3fe0 sp=0xc0015a3fb8 pc=0x376680e
1321089 runtime.goexit()
1321090 /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0015a3fe8 sp=0xc0015a3fe0 pc=0x17c89a1
1321091 created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1
1321092 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:122 +0xf8
1321093
1321094 goroutine 1 [chan receive, 22 minutes]:
1321095 runtime.gopark(0xc00027a100?, 0xc001841790?, 0xfe?, 0x2d?, 0x4185fa0?)
1321096 /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0020c7738 sp=0xc0020c7718 pc=0x17920d6
1321097 runtime.chanrecv(0xc0000f4a20, 0x0, 0x1)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

full log: https://pan.baidu.com/s/1m02d1kp4_x6BQHcFK4tJhg?pwd=mmvh

dd_part.log

Anything else?

No response

@Richard-lrg Richard-lrg added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2023
@yanliang567
Copy link
Contributor

/assign @jiaoew1991
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2023
@yanliang567 yanliang567 added this to the 2.3.4 milestone Nov 18, 2023
@xiaofan-luan
Copy link
Collaborator

mark as high priority, please take a look on it

@xiaofan-luan
Copy link
Collaborator

xiaofan-luan commented Nov 19, 2023

: A timeout exceeded while waiting to procee d with the request -> I think this might be related to minio quota exceeded

@xiaofan-luan
Copy link
Collaborator

@jiaoew1991
can we add a command in birdwatcher, get stats of files in etcd.
Like:

How many delta logs are there in total, how many stats logs are there in total?
This might be very important for such use cases.

Also we need to think of how to handle these bugs

@xiaofan-luan
Copy link
Collaborator

@Cactus-L
On the other side, could you share you you code?
If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

@Richard-lrg
Copy link
Author

@Cactus-L On the other side, could you share you you code? If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

Actually I don't do a lot of operation.
I saw from the log that milvus was caused by a large number of deletion operations, but it was not triggered by me. It seemed to be caused by some special problem.
I have detailed instructions in another issue #28514

@Richard-lrg
Copy link
Author

@Cactus-L On the other side, could you share you you code? If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

And do u know how to improve minio rate limited? I will try to modify it and watch the service status.

@Richard-lrg
Copy link
Author

Richard-lrg commented Nov 20, 2023

@Cactus-L On the other side, could you share you you code? If you don't do frequent flush or delete there shouldn't be a reason why Minio rate limited

// upsert data:
public void upsertRows(String collection, List<JSONObject> rowList) { UpsertParam upsertParam = UpsertParam.newBuilder() .withCollectionName(collection) .withRows(rowList) .build(); R<MutationResult> responseR = getMilvusClient().upsert(upsertParam); if (Objects.isNull(responseR) || Objects.isNull(responseR.getStatus()) || !responseR.getStatus().equals(R.Status.Success.getCode())) { String message = responseR.getMessage(); if (StringUtils.isNotBlank(message) && message.contains("CollectionNotExists")) { throw new MilvusCollectionNotExistsException(message); } else { throw new MilvusException(message); } } }

// search data:
public SearchResultsWrapper search(List<List<Float>> searchVectors, String collection, String searchFiled, Integer topK, List<String> outputFields, String searchParam, String filterExpr) throws BaseException { collection = collection.replace("-", ""); if (StringUtils.isEmpty(filterExpr)) { filterExpr = ""; } SearchParam requestParam = SearchParam.newBuilder() .withCollectionName(collection) .withConsistencyLevel(ConsistencyLevelEnum.STRONG) .withMetricType(MetricType.IP) .withOutFields(outputFields) .withTopK(topK) .withVectors(searchVectors) .withVectorFieldName(searchFiled) .withParams(searchParam) .withExpr(filterExpr) .build(); R<SearchResults> respSearch = getMilvusClient().withTimeout(MILVUS_SEARCH_TIMEOUT_SECONDS, TimeUnit.SECONDS) .search(requestParam); if (respSearch.getStatus() != R.Status.Success.getCode()) { String exceptionMsg = String.format("search failed, collection : %s, error msg : %s", collection, respSearch.getMessage()); if (respSearch.getMessage() != null && respSearch.getMessage().contains("CollectionNotExists")) { throw new MilvusCollectionNotExistsException(exceptionMsg); } else { throw new MilvusException(exceptionMsg); } } SearchResultsWrapper wrapperSearch = new SearchResultsWrapper(respSearch.getData().getResults()); return wrapperSearch; }

@xiaofan-luan
Copy link
Collaborator

the reason of frequent delete is because of upsert.

each upsert triggers one delete and insert.

How many tps are you looking for? could you offer the datanode log?

@Richard-lrg
Copy link
Author

the reason of frequent delete is because of upsert.

each upsert triggers one delete and insert.

How many tps are you looking for? could you offer the datanode log?

1k
here is the full log of datanode
image

@yanliang567 yanliang567 added the priority/urgent Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 21, 2023
@xiaofan-luan
Copy link
Collaborator

why all insert has to be upsert?

@xiaofan-luan
Copy link
Collaborator

could you give us a detailed scenario you need to update the vector that frequent?

@xiaofan-luan
Copy link
Collaborator

@yanliang567
could we setup a test environment in house with 1000 upsert tps and check?

@xiaofan-luan
Copy link
Collaborator

I think this due to frequent delete -> causing frequent flush/compaction -> Minio can not serve the write concurrency.

@aoiasd can you investigate on the reason of flush happens? if it's due to memory limit, add more datanode could help.

@Richard-lrg
Copy link
Author

why all insert has to be upsert?

Similar to the scenario where ES builds a data index, each piece of data has a unique primary key. When real-time data is synchronized to the system, we usually perform an upsert operation because it is impossible to distinguish whether the data is create or update. (if you need to differentiate, you need to pay a higher cost)

@xiaofan-luan
Copy link
Collaborator

upsert on milvus is very expressive.

Maybe you can query and check if the entity exist before you upsert

@xiaofan-luan
Copy link
Collaborator

There are couple of things to do:
1.under upsert, we should think of a way to reduce delete mark(may be we can read once on delete or we can handle this on flush) @aoiasd
2. reduce duplicated flush. this happens under syncing in queue. new flush may be submit again and again.
3. Add a jitter, sync with period become a random period so flush won't happen at the same time.

Few things to do:

  1. check your dataNode.segment.syncPeriod to larger value may reduce the flush pressure. 1800 might be a proper value?
  2. scale minio cluster or use ssd(Sorry about this but this could help)

@xiaofan-luan
Copy link
Collaborator

The delete mark problem will be fully solved by L0 Delete feature on 2.4.
Search for more details

czs007 pushed a commit that referenced this issue Dec 20, 2023
related to #28549
pr: #28626

1. avoid duplicated sync segments under syncing states
2. add jitter to avoid sync segments at the same time

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
@Richard-lrg
Copy link
Author

There are couple of things to do: 1.under upsert, we should think of a way to reduce delete mark(may be we can read once on delete or we can handle this on flush) @aoiasd 2. reduce duplicated flush. this happens under syncing in queue. new flush may be submit again and again. 3. Add a jitter, sync with period become a random period so flush won't happen at the same time.

Few things to do:

  1. check your dataNode.segment.syncPeriod to larger value may reduce the flush pressure. 1800 might be a proper value?
  2. scale minio cluster or use ssd(Sorry about this but this could help)

if I set dataCoord.compaction.enableAutoCompaction=false, will the problem be alleviated?
(all the indexType of my collections are FLAT)

@xiaofan-luan
Copy link
Collaborator

There are couple of things to do: 1.under upsert, we should think of a way to reduce delete mark(may be we can read once on delete or we can handle this on flush) @aoiasd 2. reduce duplicated flush. this happens under syncing in queue. new flush may be submit again and again. 3. Add a jitter, sync with period become a random period so flush won't happen at the same time.
Few things to do:

  1. check your dataNode.segment.syncPeriod to larger value may reduce the flush pressure. 1800 might be a proper value?
  2. scale minio cluster or use ssd(Sorry about this but this could help)

if I set dataCoord.compaction.enableAutoCompaction=false, will the problem be alleviated? (all the indexType of my collections are FLAT)

This has nothing to do with compaction. I think the most important thing is to avoid using frequent upsert and delete.
Another important action is to make sure you minio cluster has enough IOPs. Try to deploy minio on SSD might help.

@xiaofan-luan
Copy link
Collaborator

upgrade to 2.3.4 could help a little bit.
Any specific reason we have some much delete?

@yanliang567 yanliang567 modified the milestones: 2.3.4, 2.3.5 Jan 12, 2024
@hemangjoshi37a
Copy link

from the logs and your description, it seems that the frequent upserts are causing a high rate of delete operations, leading to excessive flush and compaction activities. This is likely overwhelming the MinIO's capacity, causing timeouts.

To alleviate this issue:

  1. Consider reducing the frequency of upsert operations. Frequent upserts can strain the system, particularly in a high-throughput environment.
  2. Increase MinIO's IOPS capacity, preferably by deploying it on SSDs, to handle the increased load.
  3. Adjust dataNode.segment.syncPeriod to a higher value like 1800 to reduce flush pressure.
  4. Await the L0 Delete feature in version 2.4 for a more robust solution to handle frequent deletes.

Implementing these measures should help mitigate the issue.

@xiaofan-luan
Copy link
Collaborator

from the logs and your description, it seems that the frequent upserts are causing a high rate of delete operations, leading to excessive flush and compaction activities. This is likely overwhelming the MinIO's capacity, causing timeouts.

To alleviate this issue:

  1. Consider reducing the frequency of upsert operations. Frequent upserts can strain the system, particularly in a high-throughput environment.
  2. Increase MinIO's IOPS capacity, preferably by deploying it on SSDs, to handle the increased load.
  3. Adjust dataNode.segment.syncPeriod to a higher value like 1800 to reduce flush pressure.
  4. Await the L0 Delete feature in version 2.4 for a more robust solution to handle frequent deletes.

Implementing these measures should help mitigate the issue.

Thanks for the clarifciation!

@yanliang567 yanliang567 modified the milestones: 2.3.5, 2.3.6 Jan 22, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.6, 2.3.7 Jan 30, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.7, 2.3.9, 2.3.10 Feb 18, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.10, 2.3.11 Feb 28, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.11, 2.3.12 Mar 11, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.12, 2.3.13 Mar 22, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.13, 2.3.14 Apr 15, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.14, 2.3.15 Apr 23, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.15, 2.3.16 May 16, 2024
@yanliang567
Copy link
Contributor

verified no reproduce on 2.4.5, close for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants