Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Distinct field values #25343

Open
1 task done
filip-halt opened this issue Jul 5, 2023 · 22 comments
Open
1 task done

[Feature]: Distinct field values #25343

filip-halt opened this issue Jul 5, 2023 · 22 comments
Assignees
Labels
kind/feature Issues related to feature request from users

Comments

@filip-halt
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

Find all the unique field values in a collection without having to iterate through all data.

Describe the solution you'd like.

Something equivalent to sql query(select distinct field_name from mytable)

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

@filip-halt filip-halt added the kind/feature Issues related to feature request from users label Jul 5, 2023
@xiaofan-luan
Copy link
Collaborator

is there a specific use case for the distinct clause?

@xiaofan-luan
Copy link
Collaborator

is it in search or only in query?

@faileon
Copy link

faileon commented Sep 25, 2023

Is there any plans for this? My use case would be to present user with a list of available values which can be used for filtering in future queries. Without this I have to manage a list of distinct values myself elsewhere.

@xiaofan-luan
Copy link
Collaborator

can you describe you data model and the specific use case so I can give more advice

@faileon
Copy link

faileon commented Sep 26, 2023

Let me try. I let my users store their arbitrary documents in Milvus. I let them define which fields should be used to make embeddings and which are metadata. For each tenant I create a different collection. Users define from which fields on the original documents I should make embeddings and which to use as metadata for filtering purposes. Let's say one of my users has collection of "articles" and defined "category" as a metadata field that can be of any string value ("sport", "news",...). I would like to get distinct values of said "category" field - is that possible within Milvus?

@cardoso-neto
Copy link

I also couldn't find how to do this.

@xiaofan-luan
Copy link
Collaborator

I thought groupby feature is what you are looking for.
You can groupby a field name and get top k most related group but not entity.
is that what you are looking for? @cardoso-neto
This feature will be released on 2.4

@cardoso-neto
Copy link

This would work indeed. Looking forward.

@cardoso-neto
Copy link

My use case is reading all unique values of a Milvus collection column. More specifically the column I use for partition key. Since Milvus "maps" that to a standardized name (_default_i), I couldn't use Collection().partitions for that.

@xiaofan-luan
Copy link
Collaborator

_default_i

So that's saying you want to know how many partition keys are there in total?

@xiaofan-luan
Copy link
Collaborator

which means count the distinct partitionkey

@xiaofan-luan
Copy link
Collaborator

/assign @jaime0815
sounds like something we need to work on

@lehotskysamuel
Copy link

I have a similar use case: I take a book, split it into chunks and then store the book title in scalar column for each chunk. I then process n books. When doing the vector search, I want to filter by a book (or multiple).

With this functionality I could:

  1. query milvus to get all distinct values from the column (all book titles) --> THIS IS WHAT THIS TICKET IS ABOUT
  2. display the list on user interface and let user pick a list of books to search across
  3. do the vector search with filtering based on the book

@Izukimat
Copy link

I second to @lehotskysamuel.
In RAG application, all 1-3 functionality are essential. I wonder if we could achieve this without preparing another database.

@LeoHemamou
Copy link

Does anyone know if it's solved or not ?

@ckrapu-nv
Copy link

This feature makes it much easier to support an incremental update.

@xiaofan-luan
Copy link
Collaborator

Can you explain a little bit about your use case?
We do have something called grouping search. where you can group by one field and get top K groups rather than topk entities

see https://milvus.io/docs/single-vector-search.md#Grouping-search

@faileon
Copy link

faileon commented Sep 29, 2024

Has there been any progression on this?

My use case is still the same, for example:
Let's say I have a column "colors" which is an array of varchar - how can I retrieve all distinct colors from the collection?

@xiaofan-luan
Copy link
Collaborator

Has there been any progression on this?

My use case is still the same, for example: Let's say I have a column "colors" which is an array of varchar - how can I retrieve all distinct colors from the collection?

So you want to search topk for different colors? or simply count different colors in this collection?
Can you give an example?

@faileon
Copy link

faileon commented Sep 30, 2024

Has there been any progression on this?
My use case is still the same, for example: Let's say I have a column "colors" which is an array of varchar - how can I retrieve all distinct colors from the collection?

So you want to search topk for different colors? or simply count different colors in this collection? Can you give an example?

I simply need to know all colors that exists in the collection, so I can display appropriate filters on the frontend for the users. Right now I have to store this information in a different database to get this information.

EDIT: To explain further on the colors example - I do not know what colors there are in the collection and neither does the user. I need to give this information back to the UI, so users can chose to search only for documents that are "red" or "green".

@moonSandra
Copy link

Hi.
is there any progress for this?
my use case is unique ID, i don't want to have duplication data. it is easier to generate a unique ID , base on what i have, then i can be sure if a duplication occurred the doc will ignore or just update the same row ( like other traditional DBMS)

@faileon
Copy link

faileon commented Jan 8, 2025

Hi. is there any progress for this? my use case is unique ID, i don't want to have duplication data. it is easier to generate a unique ID , base on what i have, then i can be sure if a duplication occurred the doc will ignore or just update the same row ( like other traditional DBMS)

I think it's on the roadmap for middle of 2025

Aggregations
Scalar field aggregations, e.g. min, max, count, distinct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues related to feature request from users
Projects
None yet
Development

No branches or pull requests

10 participants