Description
Since I have been implementing the Postgresql memory store for python (#1354), I have been digging into the pgvector docs and referencing the C# implementation to try and maintain parity and cross-compatibility. This has led me to the finding the following, which I believe are misconceptions that made their way into the C# implementation.
1. Indexing occurs before table contains data
Based on the documentation for pgvector, the index should not be created until there is some data in the table.
Create the index after the table has some data
With the current implementation, the index is being created upon creation of the collection and as such without any data.
Since indexing will take some time (depending on volume of data) it may not make sense for index creation to be part of the normal flow.
2. Bad default for number of lists when indexing
Based on the documentation for pgvector, I believe that the 1000 list default is unfavorable and that any default should be a function of the number of rows as opposed to set number.
Choose an appropriate number of lists - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows
3. Number of lists does not improve recall at cost of speed
Based on the documentation for pgvector, creating an index improves performance at the cost of recall. I believe the misunderstanding comes from the information about the number of probes used when querying trading off speed for recall.
When querying, specify an appropriate number of probes (higher is better for recall, lower is better for speed) - a good place to start is sqrt(lists)
To speed up queries with an index, increase the number of inverted lists (at the expense of recall).
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);
Let me know if there are any questions, I would be happy to engage in a deeper discussion on these topics.
Metadata
Assignees
Type
Projects
Status
Sprint: Done