-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Queries intermittently failing #36370
Comments
@yanliang567 could you please try to release and reload the collection? Also we'd like to have the completed milvus logs for investigation, please refer this doc to export the whole Milvus logs. /assign @nairan-deshaw |
Hi @yanliang567
We'll try this if we encounter the issue again, but ideally we do not want to release the partitions if this error shows up as it will lead to an increased latency in the applications. Attaching all the logs from the cluster during the 5 minute crash window |
@congqixia please help to take a loook, I did not find any critical errors in the logs. |
OK, I shall inspect the log for detailed clues. |
Hi @congqixia
Yes that is correct.
Can you let me know what command would you like to get info from? |
@congqixia following up for the last update. |
@congqixia can you please help with the next steps here. |
Please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher |
I have collected the outputs of |
@congqixia please help to take a look |
Related to milvus-io#36370 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
the problem could be reproduced with following script // using master client packge
package main
import (
"context"
"fmt"
"log"
"math/rand"
"github.com/milvus-io/milvus/client/v2"
"github.com/milvus-io/milvus/client/v2/entity"
"github.com/milvus-io/milvus/client/v2/index"
)
var (
milvusAddr = `127.0.0.1:19530`
collectionName = `multi_parts`
idCol, randomCol, embeddingCol = "ID", "random", "embeddings"
nEntities = 1000
partNum = 8
dim = 8
)
func main() {
ctx := context.Background()
log.Println("start connecting to Milvus")
c, err := client.New(ctx, &client.ClientConfig{
Address: milvusAddr,
})
if err != nil {
log.Fatal("failed to connect to milvus, err: ", err.Error())
}
defer c.Close(ctx)
log.Println("Test with collection: ", collectionName)
if has, err := c.HasCollection(ctx, client.NewHasCollectionOption(collectionName)); err != nil {
log.Fatal("failed to check collection exists or not", err.Error())
} else if has {
log.Println("exist hello_milvus, start to drop first")
c.DropCollection(ctx, client.NewDropCollectionOption(collectionName))
}
schema := entity.NewSchema().WithName(collectionName).
WithField(entity.NewField().WithName(idCol).WithDataType(entity.FieldTypeInt64).WithIsPrimaryKey(true)).
WithField(entity.NewField().WithName(embeddingCol).WithDataType(entity.FieldTypeFloatVector).WithDim(int64(dim)))
err = c.CreateCollection(ctx, client.NewCreateCollectionOption(collectionName, schema))
if err != nil {
log.Fatal("failed to create collection", err.Error())
}
_, err = c.CreateIndex(ctx, client.NewCreateIndexOption(collectionName, embeddingCol, index.NewFlatIndex(entity.L2)))
if err != nil {
log.Fatal("failed to create index: ", err.Error())
}
for i := 0; i < partNum; i++ {
partitionName := fmt.Sprintf("part_%d", i)
err := c.CreatePartition(ctx, client.NewCreatePartitionOption(collectionName, partitionName))
if err != nil {
log.Fatal("failed to create partition:", err.Error())
}
idData := make([]int64, 0, nEntities)
vectorData := make([][]float32, 0, nEntities)
// generate data
for i := 0; i < nEntities; i++ {
idData = append(idData, int64(i))
vec := make([]float32, 0, 8)
for j := 0; j < 8; j++ {
vec = append(vec, rand.Float32())
}
vectorData = append(vectorData, vec)
}
_, err = c.Insert(ctx, client.NewColumnBasedInsertOption(collectionName).
WithPartition(partitionName).
WithInt64Column(idCol, idData).
WithFloatVectorColumn(embeddingCol, dim, vectorData))
if err != nil {
log.Fatal("failed to insert data:", err.Error())
}
}
task, err := c.LoadPartitions(ctx, client.NewLoadPartitionsOption(collectionName, []string{"_default"}))
if err != nil {
log.Println("partition load failed: ", err.Error())
}
task.Await(ctx)
log.Println("partitions prepared")
vec := make([]float32, 0, 8)
for j := 0; j < 8; j++ {
vec = append(vec, rand.Float32())
}
type Row struct {
ID int64 `milvus:"name:ID"`
Vector []float32 `milvus:"name:embeddings"`
}
type Count struct {
Count int64 `milvus:"name:count(*)"`
}
go func() {
for {
for i := 0; i < partNum; i++ {
partition := fmt.Sprintf("part_%d", i)
task, err := c.LoadPartitions(ctx, client.NewLoadPartitionsOption(collectionName, []string{partition}))
if err != nil {
log.Println("failed to load partition")
continue
}
task.Await(ctx)
}
for i := 0; i < partNum; i++ {
partition := fmt.Sprintf("part_%d", i)
c.ReleasePartitions(ctx, client.NewReleasePartitionsOptions(collectionName, partition))
}
}
}()
for {
rss, err := c.Search(ctx, client.NewSearchOption(collectionName, 10, []entity.Vector{entity.FloatVector(vec)}))
if err != nil {
log.Println("failed to search collection: ", err.Error())
break
}
var rows []*Row
for _, rs := range rss {
rs.Unmarshal(&rows)
for _, row := range rows {
log.Print(row.ID, " ")
}
log.Printf("\n%d rows returned\n", len(rows))
}
ds, err := c.Query(ctx, client.NewQueryOption(collectionName).WithOutputFields([]string{"count(*)"}))
if err != nil {
log.Println("failed to search collection: ", err.Error())
break
}
var crs []*Count
ds.Unmarshal(&crs)
for _, rs := range crs {
log.Println("Count(*): ", rs.Count)
}
}
} The loaded partition meta is maintained and read multiple times during whole search procedure which caused concurrent problem when search/query happens when load/release partition happens #36879 make the partition check only happens in delegator |
@congqixia |
to save memory, the recommended way is to mmap the collection but not manually load/release partitions. At present, this behavior doesn't appear to be a bug. However, it's worth noting that without a distributed locking mechanism in place, there's no guarantee against concurrent search operations on the same partition or collection while load or release operations are in progress. |
Related to #36370 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@xiaofan-luan concurrent read & release partition is the user case here
|
it is weird since you don't know the exact partitions you are searching |
It is. But it's even weirder that continuous search may failed when ops release a partition. At least system could provide the read service either before release or after releasing them. The way that Milvus is used here could be inproper. I was wondering that if @nairan-deshaw could elaborate a little bit more about the actual requirement here? |
@congqixia The bug looks related to the fact that search is happening on a released partition? Let me know if you need more details around the usage from our end. |
Is there an existing issue for this?
Environment
Current Behavior
When a query is performed, there are times when there is a transient query failure with the below exception on the client side:
This keeps happening at times and we usually solve by killing the said query node. This usually resolves the error, but we have been seeing it happen every now and then. I found a couple of related issues in #24048 and #27311.
Expected Behavior
The queries return correctly
Steps To Reproduce
Milvus Log
To be attached soon
Anything else?
NA
The text was updated successfully, but these errors were encountered: