Understanding of pgvector indexing seems incorrect

Since I have been implementing the Postgresql memory store for python (#1354), I have been digging into the pgvector docs and referencing the C# implementation to try and maintain parity and cross-compatibility. This has led me to the finding the following, which I believe are misconceptions that made their way into the C# implementation.


### 1. Indexing occurs before table contains data

https://github.com/microsoft/semantic-kernel/blob/3105aa13d9e01056010ebdb07052aa3f51b920dd/dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresDbClient.cs#L76

Based on the [documentation](https://github.com/pgvector/pgvector#indexing) for pgvector, the index should not be created until there is some data in the table. 

> Create the index after the table has some data

With the current implementation, the index is being created upon creation of the collection and as such without any data.

Since indexing will take some time (depending on volume of data) it may not make sense for index creation to be part of the normal flow.

### 2. Bad default for number of lists when indexing

https://github.com/microsoft/semantic-kernel/blob/3105aa13d9e01056010ebdb07052aa3f51b920dd/dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresMemoryStore.cs#L26

Based on the [documentation](https://github.com/pgvector/pgvector#indexing) for pgvector, I believe that the 1000 list default is unfavorable and that any default should be a function of the number of rows as opposed to set number.

> Choose an appropriate number of lists - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows

### 3. Number of lists does not improve recall at cost of speed

https://github.com/microsoft/semantic-kernel/blob/3105aa13d9e01056010ebdb07052aa3f51b920dd/dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresMemoryStore.cs#L34

Based on the [documentation](https://github.com/pgvector/pgvector#indexing) for pgvector, creating an index improves performance at the cost of recall. I believe the misunderstanding comes from the information about the number of probes used when querying trading off speed for recall.

> When querying, specify an appropriate number of [probes](https://github.com/pgvector/pgvector#query-options) (higher is better for recall, lower is better for speed) - a good place to start is sqrt(lists)

>To speed up queries with an index, increase the number of inverted lists (at the expense of recall).
>`CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);`


Let me know if there are any questions, I would be happy to engage in a deeper discussion on these topics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding of pgvector indexing seems incorrect #1810

cschadewitz
openedon Jul 2, 2023

1. Indexing occurs before table contains data

2. Bad default for number of lists when indexing

3. Number of lists does not improve recall at cost of speed

Assignees

Labels

Type

Projects

Milestone

Relationships

Development