Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Queries intermittently failing #36370

Open
1 task done
nairan-deshaw opened this issue Sep 19, 2024 · 17 comments
Open
1 task done

[Bug]: Queries intermittently failing #36370

nairan-deshaw opened this issue Sep 19, 2024 · 17 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@nairan-deshaw
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 / 1TB
- GPU: NA
- Others: NA

Current Behavior

When a query is performed, there are times when there is a transient query failure with the below exception on the client side:

[20240918 14:50:51:406 decorators.py:146 ERROR] RPC error: [search], <MilvusException: (code=65535, message=fail to search on QueryNode 5317: worker(5312) query failed: segment 451624937786038787 belongs to partition 451624937797599857, which is not in [451624937797649778 451624937788893281 451624937803354980 451624937789864283 451624937799482995 451624937798419393 451624937803341501 451624937804113533 451624937801063954 451624937804901395 451624937799447824 451624937801057015 451624937801061594 451624937801061586 451624937800808398 451624937789514905 451624937798683167 451624937802856486 452056571429041840 451624937800263288 451624937799747707 451624937804934821 451624937804606764 451624937789507282])>, <Time:{'RPC start': '2024-09-18 14:50:50.725078', 'RPC error': '2024-09-18 14:50:51.406550'}> [20240918 14:50:51:408 milvus_client.py:318 ERROR] Failed to search collection: <redacted>

This keeps happening at times and we usually solve by killing the said query node. This usually resolves the error, but we have been seeing it happen every now and then. I found a couple of related issues in #24048 and #27311.

Expected Behavior

The queries return correctly

Steps To Reproduce

Try running a query operation

Milvus Log

To be attached soon

Anything else?

NA

@nairan-deshaw nairan-deshaw added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 19, 2024
@yanliang567
Copy link
Contributor

@yanliang567 could you please try to release and reload the collection? Also we'd like to have the completed milvus logs for investigation, please refer this doc to export the whole Milvus logs.

/assign @nairan-deshaw
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2024
@nairan-deshaw
Copy link
Author

Hi @yanliang567

could you please try to release and reload the collection?

We'll try this if we encounter the issue again, but ideally we do not want to release the partitions if this error shows up as it will lead to an increased latency in the applications.

Attaching all the logs from the cluster during the 5 minute crash window

logs.tar.gz

@yanliang567
Copy link
Contributor

@congqixia please help to take a loook, I did not find any critical errors in the logs.
/assign @congqixia
/unassign

@congqixia
Copy link
Contributor

OK, I shall inspect the log for detailed clues.
Just some quick questions for @nairan-deshaw
The quick assumption here is that the queried collection is loaded via LoadPartitions and only some partitions are loaded here. Am I right?
Also, could you please provide the meta back with birdwatcher so that it could be easier to map the segment id missing to the whole picture?

@nairan-deshaw
Copy link
Author

Hi @congqixia

The quick assumption here is that the queried collection is loaded via LoadPartitions and only some partitions are loaded here. Am I right?

Yes that is correct.

Also, could you please provide the meta back

Can you let me know what command would you like to get info from?

@nairan-deshaw
Copy link
Author

@congqixia following up for the last update.

@nairan-deshaw
Copy link
Author

@congqixia can you please help with the next steps here.

@yanliang567
Copy link
Contributor

Hi @congqixia

The quick assumption here is that the queried collection is loaded via LoadPartitions and only some partitions are loaded here. Am I right?

Yes that is correct.

Also, could you please provide the meta back

Can you let me know what command would you like to get info from?

Please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher

@nairan-deshaw
Copy link
Author

I have collected the outputs of show segment and show segment-loaded-grpc. Attaching them in the tar file below. Please let me know if any other specific command is needed for investigation.
logs.tar.gz

@yanliang567
Copy link
Contributor

@congqixia please help to take a look

@yanliang567 yanliang567 added this to the 2.4.14 milestone Oct 15, 2024
congqixia added a commit to congqixia/milvus that referenced this issue Oct 15, 2024
Related to milvus-io#36370

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@congqixia
Copy link
Contributor

the problem could be reproduced with following script

// using master client packge
package main

import (
	"context"
	"fmt"
	"log"
	"math/rand"

	"github.com/milvus-io/milvus/client/v2"
	"github.com/milvus-io/milvus/client/v2/entity"
	"github.com/milvus-io/milvus/client/v2/index"
)

var (
	milvusAddr                     = `127.0.0.1:19530`
	collectionName                 = `multi_parts`
	idCol, randomCol, embeddingCol = "ID", "random", "embeddings"
	nEntities                      = 1000
	partNum                        = 8
	dim                            = 8
)

func main() {
	ctx := context.Background()

	log.Println("start connecting to Milvus")
	c, err := client.New(ctx, &client.ClientConfig{
		Address: milvusAddr,
	})
	if err != nil {
		log.Fatal("failed to connect to milvus, err: ", err.Error())
	}
	defer c.Close(ctx)

	log.Println("Test with collection: ", collectionName)

	if has, err := c.HasCollection(ctx, client.NewHasCollectionOption(collectionName)); err != nil {
		log.Fatal("failed to check collection exists or not", err.Error())
	} else if has {
		log.Println("exist hello_milvus, start to drop first")
		c.DropCollection(ctx, client.NewDropCollectionOption(collectionName))
	}
	schema := entity.NewSchema().WithName(collectionName).
		WithField(entity.NewField().WithName(idCol).WithDataType(entity.FieldTypeInt64).WithIsPrimaryKey(true)).
		WithField(entity.NewField().WithName(embeddingCol).WithDataType(entity.FieldTypeFloatVector).WithDim(int64(dim)))

	err = c.CreateCollection(ctx, client.NewCreateCollectionOption(collectionName, schema))
	if err != nil {
		log.Fatal("failed to create collection", err.Error())
	}

	_, err = c.CreateIndex(ctx, client.NewCreateIndexOption(collectionName, embeddingCol, index.NewFlatIndex(entity.L2)))
	if err != nil {
		log.Fatal("failed to create index: ", err.Error())
	}

	for i := 0; i < partNum; i++ {
		partitionName := fmt.Sprintf("part_%d", i)
		err := c.CreatePartition(ctx, client.NewCreatePartitionOption(collectionName, partitionName))
		if err != nil {
			log.Fatal("failed to create partition:", err.Error())
		}

		idData := make([]int64, 0, nEntities)
		vectorData := make([][]float32, 0, nEntities)
		// generate data
		for i := 0; i < nEntities; i++ {
			idData = append(idData, int64(i))
			vec := make([]float32, 0, 8)
			for j := 0; j < 8; j++ {
				vec = append(vec, rand.Float32())
			}

			vectorData = append(vectorData, vec)
		}
		_, err = c.Insert(ctx, client.NewColumnBasedInsertOption(collectionName).
			WithPartition(partitionName).
			WithInt64Column(idCol, idData).
			WithFloatVectorColumn(embeddingCol, dim, vectorData))
		if err != nil {
			log.Fatal("failed to insert data:", err.Error())
		}
	}
	task, err := c.LoadPartitions(ctx, client.NewLoadPartitionsOption(collectionName, []string{"_default"}))
	if err != nil {
		log.Println("partition load failed: ", err.Error())
	}
	task.Await(ctx)

	log.Println("partitions prepared")

	vec := make([]float32, 0, 8)
	for j := 0; j < 8; j++ {
		vec = append(vec, rand.Float32())
	}

	type Row struct {
		ID     int64     `milvus:"name:ID"`
		Vector []float32 `milvus:"name:embeddings"`
	}

	type Count struct {
		Count int64 `milvus:"name:count(*)"`
	}

	go func() {
		for {
			for i := 0; i < partNum; i++ {
				partition := fmt.Sprintf("part_%d", i)
				task, err := c.LoadPartitions(ctx, client.NewLoadPartitionsOption(collectionName, []string{partition}))
				if err != nil {
					log.Println("failed to load partition")
					continue
				}
				task.Await(ctx)
			}

			for i := 0; i < partNum; i++ {
				partition := fmt.Sprintf("part_%d", i)
				c.ReleasePartitions(ctx, client.NewReleasePartitionsOptions(collectionName, partition))
			}
		}
	}()

	for {
		rss, err := c.Search(ctx, client.NewSearchOption(collectionName, 10, []entity.Vector{entity.FloatVector(vec)}))
		if err != nil {
			log.Println("failed to search collection: ", err.Error())
			break
		}

		var rows []*Row
		for _, rs := range rss {
			rs.Unmarshal(&rows)
			for _, row := range rows {
				log.Print(row.ID, " ")
			}
			log.Printf("\n%d rows returned\n", len(rows))
		}

		ds, err := c.Query(ctx, client.NewQueryOption(collectionName).WithOutputFields([]string{"count(*)"}))
		if err != nil {
			log.Println("failed to search collection: ", err.Error())
			break
		}
		var crs []*Count
		ds.Unmarshal(&crs)
		for _, rs := range crs {
			log.Println("Count(*): ", rs.Count)
		}
	}
}

The loaded partition meta is maintained and read multiple times during whole search procedure which caused concurrent problem when search/query happens when load/release partition happens

#36879 make the partition check only happens in delegator

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Oct 15, 2024
@xiaofan-luan
Copy link
Collaborator

@congqixia
I don't quite understand why this is a problem. It's not designed to do concurrent release and search and the behaviour seems to be unpredictable

@xiaofan-luan
Copy link
Collaborator

@nairan-deshaw

to save memory, the recommended way is to mmap the collection but not manually load/release partitions.

At present, this behavior doesn't appear to be a bug. However, it's worth noting that without a distributed locking mechanism in place, there's no guarantee against concurrent search operations on the same partition or collection while load or release operations are in progress.

sre-ci-robot pushed a commit that referenced this issue Oct 15, 2024
Related to #36370

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@congqixia
Copy link
Contributor

@xiaofan-luan concurrent read & release partition is the user case here
At least system shall provide a stable behavior for this scenario.

  1. block until release done
  2. or, search the partition as a whole or exclude releasing partition when search whole collection

@xiaofan-luan
Copy link
Collaborator

@xiaofan-luan concurrent read & release partition is the user case here At least system shall provide a stable behavior for this scenario.

  1. block until release done
  2. or, search the partition as a whole or exclude releasing partition when search whole collection

it is weird since you don't know the exact partitions you are searching

@congqixia
Copy link
Contributor

@xiaofan-luan concurrent read & release partition is the user case here At least system shall provide a stable behavior for this scenario.

  1. block until release done
  2. or, search the partition as a whole or exclude releasing partition when search whole collection

it is weird since you don't know the exact partitions you are searching

It is. But it's even weirder that continuous search may failed when ops release a partition. At least system could provide the read service either before release or after releasing them.

The way that Milvus is used here could be inproper. I was wondering that if @nairan-deshaw could elaborate a little bit more about the actual requirement here?

@nairan-deshaw
Copy link
Author

@congqixia
We perform searches wherein multiple users try loading partitions, performing searches and then releasing the loaded partitions. The loading and releasing of partitions is controlled through a distributed locking on our end to ensure that a partition is loaded when it is not already loaded and a partition is released only when no users are performing active searches on a partition.

The bug looks related to the fact that search is happening on a released partition?

Let me know if you need more details around the usage from our end.

@yanliang567 yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.18, 2.4.19, 2.4.20 Dec 24, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.20, 2.4.21 Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants