Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When deleting a varchar collection, deleteBufferSize expands 7 times and compaction is not triggered in time #37582

Closed
1 task done
ThreadDao opened this issue Nov 11, 2024 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20241106-20534a3f-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server config

qn: 5*8c32g

    dataCoord:
      compaction:
        taskPrioritizer: default
      enableActiveStandby: true
      segment:
        expansionRate: 1.15
        maxSize: 2048
        sealProportion: 0.12
    queryNode:
      levelZeroForwardPolicy: RemoteLoad
      streamingDeltaForwardPolicy: Direct
    quotaAndLimits:
      dml:
        deleteRate:
          max: 2
        enabled: true
        insertRate:
          max: 16
      limitWriting:
        deleteBufferRowCountProtection:
          enabled: true
          highWaterLevel: 25000000
          lowWaterLevel: 12000000
        deleteBufferSizeProtection:
          enabled: true
          highWaterLevel: 1073741824 (1GB) 
          lowWaterLevel: 268435456 (256MB)
        growingSegmentsSizeProtection:
          enabled: true
          highWaterLevel: 0.2
          lowWaterLevel: 0.1
          minRateRatio: 0.5
        l0SegmentsRowCountProtection:
          enabled: true
          highWaterLevel: 50000000
          lowWaterLevel: 25000000
        memProtection:
          dataNodeMemoryHighWaterLevel: 0.85
          dataNodeMemoryLowWaterLevel: 0.75
          queryNodeMemoryHighWaterLevel: 0.85
          queryNodeMemoryLowWaterLevel: 0.75
      limits:
        complexDeleteLimitEnable: true
        maxOutputSize: 209715200

test steps

  1. collection has a varchar field of length 64 and a vector field of dim 128.
  2. delete 60million pks of total 100m with batch 60,000 during concurrent search

test results

  • (metrics of compact-opt-rate-100m-1)[https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=P1809F7CD0C75ACF3&var-namespace=qa-milvus&var-instance=compact-opt-rate-100m-1&var-collection=All&var-app_name=milvus&from=1731295244000&to=1731299335348]
  1. deleteBufferSize
    From the queryNode metrics, we can see that deleteBufferSize is too high. It has expanded by 7 times according to the actual deleteBufferRowCount. the size according to the count: 2000000*(64+8)/1024/1024~=137MB, actual deeteBufferSize=1GB
    图片

  2. querynode memory usage
    During the target update, the qn memory fluctuated by 30%, about 10GiB. Please help confirm whether this is in line with expectations? Can it be optimized? FIY, the levelZeroForwardPolicy is RemoteLoad, segment maxSize is 2048
    图片

  3. compaction trigger
    The compaction was triggered 18 minutes after the deletion started? Why does it take so long?
    图片

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

pods:

compact-opt-rate-100m-1-milvus-datanode-677f6dfd9f-2s7fd          1/1     Running                  0                5d3h    10.104.24.14    4am-node29   <none>           <none>
compact-opt-rate-100m-1-milvus-indexnode-86d4bbc5f6-kmxt7         1/1     Running                  0                5d3h    10.104.4.224    4am-node11   <none>           <none>
compact-opt-rate-100m-1-milvus-indexnode-86d4bbc5f6-nsw9r         1/1     Running                  0                5d3h    10.104.15.2     4am-node20   <none>           <none>
compact-opt-rate-100m-1-milvus-mixcoord-78fd7d5865-skv2v          1/1     Running                  0                5d3h    10.104.13.63    4am-node16   <none>           <none>
compact-opt-rate-100m-1-milvus-proxy-567b6694bf-l4ttz             1/1     Running                  0                5d3h    10.104.1.97     4am-node10   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-5qxl8       1/1     Running                  1 (3d15h ago)    5d3h    10.104.30.179   4am-node38   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-82ddx       1/1     Running                  0                26h     10.104.18.10    4am-node25   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-c9pw4       1/1     Running                  0                2d20h   10.104.16.219   4am-node21   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-gg5n2       1/1     Running                  12 (3d15h ago)   5d3h    10.104.25.45    4am-node30   <none>           <none>
compact-opt-rate-100m-1-milvus-querynode-0-7d49995585-npv9j       1/1     Running                  4 (3d15h ago)    5d3h    10.104.17.169   4am-node23   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 11, 2024
@ThreadDao
Copy link
Contributor Author

/assign @XuanYang-cn Please help investigate

@ThreadDao ThreadDao added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 11, 2024
@ThreadDao ThreadDao added this to the 2.4.15 milestone Nov 11, 2024
@XuanYang-cn
Copy link
Contributor

Acutually, L0 compaction executes too fast that compaction task num metrics add and sub one in 30s. It's not shown in the compaction task num metrics, but latency and logs can prove that
image
image

@XuanYang-cn
Copy link
Contributor

Triggers and executes too fast, but L0 segment number cannot be controled

Picked 2 segments out of 37 segments. Might need some config changes for varchar
image

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 12, 2024
@XuanYang-cn
Copy link
Contributor

For a uuid string(36), the actual size of PrimaryKey is 7 times of expected.

=== RUN   TestVarCharPrimaryKey/size
    primary_key_test.go:19: 
        	Error Trace:	/home/yangxuan/Github/milvus/internal/storage/primary_key_test.go:19
        	Error:      	Not equal: 
        	            	expected: int(44)
        	            	actual  : int64(296)
        	Test:       	TestVarCharPrimaryKey/size
        	Messages:   	uuid: f99f07ce-b546-4639-a24a-013929475a99

XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue Nov 12, 2024
See also: milvus-io#37582

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue Nov 12, 2024
See also: milvus-io#37582

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue Nov 12, 2024
See also: milvus-io#37582
pr: milvus-io#37617

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
@yanliang567 yanliang567 modified the milestones: 2.4.15, 2.4.16 Nov 14, 2024
sre-ci-robot pushed a commit that referenced this issue Nov 14, 2024
See also: #37582

---------

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Nov 14, 2024
See also: #37582
pr: #37617

---------

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
@ThreadDao
Copy link
Contributor Author

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants