[Bug]: Queries intermittently failing #36370

nairan-deshaw · 2024-09-19T12:09:34Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 / 1TB
- GPU: NA
- Others: NA

Current Behavior

When a query is performed, there are times when there is a transient query failure with the below exception on the client side:

[20240918 14:50:51:406 decorators.py:146 ERROR] RPC error: [search], <MilvusException: (code=65535, message=fail to search on QueryNode 5317: worker(5312) query failed: segment 451624937786038787 belongs to partition 451624937797599857, which is not in [451624937797649778 451624937788893281 451624937803354980 451624937789864283 451624937799482995 451624937798419393 451624937803341501 451624937804113533 451624937801063954 451624937804901395 451624937799447824 451624937801057015 451624937801061594 451624937801061586 451624937800808398 451624937789514905 451624937798683167 451624937802856486 452056571429041840 451624937800263288 451624937799747707 451624937804934821 451624937804606764 451624937789507282])>, <Time:{'RPC start': '2024-09-18 14:50:50.725078', 'RPC error': '2024-09-18 14:50:51.406550'}> [20240918 14:50:51:408 milvus_client.py:318 ERROR] Failed to search collection: <redacted>

This keeps happening at times and we usually solve by killing the said query node. This usually resolves the error, but we have been seeing it happen every now and then. I found a couple of related issues in #24048 and #27311.

Expected Behavior

The queries return correctly

Steps To Reproduce

Try running a query operation

Milvus Log

To be attached soon

Anything else?

NA

The text was updated successfully, but these errors were encountered:

yanliang567 · 2024-09-20T01:49:04Z

@yanliang567 could you please try to release and reload the collection? Also we'd like to have the completed milvus logs for investigation, please refer this doc to export the whole Milvus logs.

/assign @nairan-deshaw
/unassign

nairan-deshaw · 2024-09-25T12:07:49Z

Hi @yanliang567

could you please try to release and reload the collection?

We'll try this if we encounter the issue again, but ideally we do not want to release the partitions if this error shows up as it will lead to an increased latency in the applications.

Attaching all the logs from the cluster during the 5 minute crash window

logs.tar.gz

yanliang567 · 2024-09-26T01:18:42Z

@congqixia please help to take a loook, I did not find any critical errors in the logs.
/assign @congqixia
/unassign

congqixia · 2024-09-26T03:37:17Z

OK, I shall inspect the log for detailed clues.
Just some quick questions for @nairan-deshaw
The quick assumption here is that the queried collection is loaded via LoadPartitions and only some partitions are loaded here. Am I right?
Also, could you please provide the meta back with birdwatcher so that it could be easier to map the segment id missing to the whole picture?

nairan-deshaw · 2024-09-26T06:34:15Z

Hi @congqixia

The quick assumption here is that the queried collection is loaded via LoadPartitions and only some partitions are loaded here. Am I right?

Yes that is correct.

Also, could you please provide the meta back

Can you let me know what command would you like to get info from?

nairan-deshaw · 2024-10-07T06:10:59Z

@congqixia following up for the last update.

nairan-deshaw · 2024-10-09T18:25:43Z

@congqixia can you please help with the next steps here.

yanliang567 · 2024-10-10T00:47:37Z

Hi @congqixia

The quick assumption here is that the queried collection is loaded via LoadPartitions and only some partitions are loaded here. Am I right?

Yes that is correct.

Also, could you please provide the meta back

Can you let me know what command would you like to get info from?

Please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher

nairan-deshaw · 2024-10-11T04:47:17Z

I have collected the outputs of show segment and show segment-loaded-grpc. Attaching them in the tar file below. Please let me know if any other specific command is needed for investigation.
logs.tar.gz

yanliang567 · 2024-10-15T06:40:09Z

@congqixia please help to take a look

Related to milvus-io#36370 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

congqixia · 2024-10-15T06:48:04Z

the problem could be reproduced with following script

// using master client packge
package main

import (
	"context"
	"fmt"
	"log"
	"math/rand"

	"github.com/milvus-io/milvus/client/v2"
	"github.com/milvus-io/milvus/client/v2/entity"
	"github.com/milvus-io/milvus/client/v2/index"
)

var (
	milvusAddr                     = `127.0.0.1:19530`
	collectionName                 = `multi_parts`
	idCol, randomCol, embeddingCol = "ID", "random", "embeddings"
	nEntities                      = 1000
	partNum                        = 8
	dim                            = 8
)

func main() {
	ctx := context.Background()

	log.Println("start connecting to Milvus")
	c, err := client.New(ctx, &client.ClientConfig{
		Address: milvusAddr,
	})
	if err != nil {
		log.Fatal("failed to connect to milvus, err: ", err.Error())
	}
	defer c.Close(ctx)

	log.Println("Test with collection: ", collectionName)

	if has, err := c.HasCollection(ctx, client.NewHasCollectionOption(collectionName)); err != nil {
		log.Fatal("failed to check collection exists or not", err.Error())
	} else if has {
		log.Println("exist hello_milvus, start to drop first")
		c.DropCollection(ctx, client.NewDropCollectionOption(collectionName))
	}
	schema := entity.NewSchema().WithName(collectionName).
		WithField(entity.NewField().WithName(idCol).WithDataType(entity.FieldTypeInt64).WithIsPrimaryKey(true)).
		WithField(entity.NewField().WithName(embeddingCol).WithDataType(entity.FieldTypeFloatVector).WithDim(int64(dim)))

	err = c.CreateCollection(ctx, client.NewCreateCollectionOption(collectionName, schema))
	if err != nil {
		log.Fatal("failed to create collection", err.Error())
	}

	_, err = c.CreateIndex(ctx, client.NewCreateIndexOption(collectionName, embeddingCol, index.NewFlatIndex(entity.L2)))
	if err != nil {
		log.Fatal("failed to create index: ", err.Error())
	}

	for i := 0; i < partNum; i++ {
		partitionName := fmt.Sprintf("part_%d", i)
		err := c.CreatePartition(ctx, client.NewCreatePartitionOption(collectionName, partitionName))
		if err != nil {
			log.Fatal("failed to create partition:", err.Error())
		}

		idData := make([]int64, 0, nEntities)
		vectorData := make([][]float32, 0, nEntities)
		// generate data
		for i := 0; i < nEntities; i++ {
			idData = append(idData, int64(i))
			vec := make([]float32, 0, 8)
			for j := 0; j < 8; j++ {
				vec = append(vec, rand.Float32())
			}

			vectorData = append(vectorData, vec)
		}
		_, err = c.Insert(ctx, client.NewColumnBasedInsertOption(collectionName).
			WithPartition(partitionName).
			WithInt64Column(idCol, idData).
			WithFloatVectorColumn(embeddingCol, dim, vectorData))
		if err != nil {
			log.Fatal("failed to insert data:", err.Error())
		}
	}
	task, err := c.LoadPartitions(ctx, client.NewLoadPartitionsOption(collectionName, []string{"_default"}))
	if err != nil {
		log.Println("partition load failed: ", err.Error())
	}
	task.Await(ctx)

	log.Println("partitions prepared")

	vec := make([]float32, 0, 8)
	for j := 0; j < 8; j++ {
		vec = append(vec, rand.Float32())
	}

	type Row struct {
		ID     int64     `milvus:"name:ID"`
		Vector []float32 `milvus:"name:embeddings"`
	}

	type Count struct {
		Count int64 `milvus:"name:count(*)"`
	}

	go func() {
		for {
			for i := 0; i < partNum; i++ {
				partition := fmt.Sprintf("part_%d", i)
				task, err := c.LoadPartitions(ctx, client.NewLoadPartitionsOption(collectionName, []string{partition}))
				if err != nil {
					log.Println("failed to load partition")
					continue
				}
				task.Await(ctx)
			}

			for i := 0; i < partNum; i++ {
				partition := fmt.Sprintf("part_%d", i)
				c.ReleasePartitions(ctx, client.NewReleasePartitionsOptions(collectionName, partition))
			}
		}
	}()

	for {
		rss, err := c.Search(ctx, client.NewSearchOption(collectionName, 10, []entity.Vector{entity.FloatVector(vec)}))
		if err != nil {
			log.Println("failed to search collection: ", err.Error())
			break
		}

		var rows []*Row
		for _, rs := range rss {
			rs.Unmarshal(&rows)
			for _, row := range rows {
				log.Print(row.ID, " ")
			}
			log.Printf("\n%d rows returned\n", len(rows))
		}

		ds, err := c.Query(ctx, client.NewQueryOption(collectionName).WithOutputFields([]string{"count(*)"}))
		if err != nil {
			log.Println("failed to search collection: ", err.Error())
			break
		}
		var crs []*Count
		ds.Unmarshal(&crs)
		for _, rs := range crs {
			log.Println("Count(*): ", rs.Count)
		}
	}
}

The loaded partition meta is maintained and read multiple times during whole search procedure which caused concurrent problem when search/query happens when load/release partition happens

#36879 make the partition check only happens in delegator

xiaofan-luan · 2024-10-15T11:10:13Z

@congqixia
I don't quite understand why this is a problem. It's not designed to do concurrent release and search and the behaviour seems to be unpredictable

xiaofan-luan · 2024-10-15T11:14:58Z

@nairan-deshaw

to save memory, the recommended way is to mmap the collection but not manually load/release partitions.

At present, this behavior doesn't appear to be a bug. However, it's worth noting that without a distributed locking mechanism in place, there's no guarantee against concurrent search operations on the same partition or collection while load or release operations are in progress.

Related to #36370 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

congqixia · 2024-10-15T11:16:30Z

@xiaofan-luan concurrent read & release partition is the user case here
At least system shall provide a stable behavior for this scenario.

block until release done
or, search the partition as a whole or exclude releasing partition when search whole collection

xiaofan-luan · 2024-10-15T15:21:58Z

@xiaofan-luan concurrent read & release partition is the user case here At least system shall provide a stable behavior for this scenario.

block until release done

or, search the partition as a whole or exclude releasing partition when search whole collection

it is weird since you don't know the exact partitions you are searching

congqixia · 2024-10-16T02:10:14Z

@xiaofan-luan concurrent read & release partition is the user case here At least system shall provide a stable behavior for this scenario.

block until release done

or, search the partition as a whole or exclude releasing partition when search whole collection

it is weird since you don't know the exact partitions you are searching

It is. But it's even weirder that continuous search may failed when ops release a partition. At least system could provide the read service either before release or after releasing them.

The way that Milvus is used here could be inproper. I was wondering that if @nairan-deshaw could elaborate a little bit more about the actual requirement here?

nairan-deshaw · 2024-10-17T05:50:43Z

@congqixia
We perform searches wherein multiple users try loading partitions, performing searches and then releasing the loaded partitions. The loading and releasing of partitions is controlled through a distributed locking on our end to ensure that a partition is loaded when it is not already loaded and a partition is released only when no users are performing active searches on a partition.

The bug looks related to the fact that search is happening on a released partition?

Let me know if you need more details around the usage from our end.

nairan-deshaw added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 19, 2024

nairan-deshaw assigned yanliang567 Sep 19, 2024

sre-ci-robot assigned nairan-deshaw and unassigned yanliang567 Sep 20, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2024

sre-ci-robot assigned congqixia Sep 26, 2024

yanliang567 added this to the 2.4.14 milestone Oct 15, 2024

congqixia added a commit to congqixia/milvus that referenced this issue Oct 15, 2024

fix: Unify loaded partition check to delegator

a567d30

Related to milvus-io#36370 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

congqixia mentioned this issue Oct 15, 2024

fix: Unify loaded partition check to delegator #36879

Merged

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Oct 15, 2024

sre-ci-robot pushed a commit that referenced this issue Oct 15, 2024

fix: Unify loaded partition check to delegator (#36879)

ba25320

Related to #36370 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024

yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024

yanliang567 modified the milestones: 2.4.18, 2.4.19, 2.4.20 Dec 24, 2024

yanliang567 modified the milestones: 2.4.20, 2.4.21 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Queries intermittently failing #36370

[Bug]: Queries intermittently failing #36370

nairan-deshaw commented Sep 19, 2024

yanliang567 commented Sep 20, 2024

nairan-deshaw commented Sep 25, 2024

yanliang567 commented Sep 26, 2024

congqixia commented Sep 26, 2024

nairan-deshaw commented Sep 26, 2024

nairan-deshaw commented Oct 7, 2024

nairan-deshaw commented Oct 9, 2024

yanliang567 commented Oct 10, 2024

nairan-deshaw commented Oct 11, 2024

yanliang567 commented Oct 15, 2024

congqixia commented Oct 15, 2024

xiaofan-luan commented Oct 15, 2024

xiaofan-luan commented Oct 15, 2024

congqixia commented Oct 15, 2024

xiaofan-luan commented Oct 15, 2024

congqixia commented Oct 16, 2024

nairan-deshaw commented Oct 17, 2024

[Bug]: Queries intermittently failing #36370

[Bug]: Queries intermittently failing #36370

Comments

nairan-deshaw commented Sep 19, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Sep 20, 2024

nairan-deshaw commented Sep 25, 2024

yanliang567 commented Sep 26, 2024

congqixia commented Sep 26, 2024

nairan-deshaw commented Sep 26, 2024

nairan-deshaw commented Oct 7, 2024

nairan-deshaw commented Oct 9, 2024

yanliang567 commented Oct 10, 2024

nairan-deshaw commented Oct 11, 2024

yanliang567 commented Oct 15, 2024

congqixia commented Oct 15, 2024

xiaofan-luan commented Oct 15, 2024

xiaofan-luan commented Oct 15, 2024

congqixia commented Oct 15, 2024

xiaofan-luan commented Oct 15, 2024

congqixia commented Oct 16, 2024

nairan-deshaw commented Oct 17, 2024