Skip to content

Understanding of pgvector indexing seems incorrect #1810

Closed

Description

Since I have been implementing the Postgresql memory store for python (#1354), I have been digging into the pgvector docs and referencing the C# implementation to try and maintain parity and cross-compatibility. This has led me to the finding the following, which I believe are misconceptions that made their way into the C# implementation.

1. Indexing occurs before table contains data

await this.CreateIndexAsync(connection, collectionName, cancellationToken).ConfigureAwait(false);

Based on the documentation for pgvector, the index should not be created until there is some data in the table.

Create the index after the table has some data

With the current implementation, the index is being created upon creation of the collection and as such without any data.

Since indexing will take some time (depending on volume of data) it may not make sense for index creation to be part of the normal flow.

2. Bad default for number of lists when indexing

Based on the documentation for pgvector, I believe that the 1000 list default is unfavorable and that any default should be a function of the number of rows as opposed to set number.

Choose an appropriate number of lists - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows

3. Number of lists does not improve recall at cost of speed

/// <param name="numberOfLists">Specifies the number of lists for indexing. Higher values can improve recall but may impact performance. The default value is 1000. More info <see href="/pgvector/pgvector#indexing"/></param>

Based on the documentation for pgvector, creating an index improves performance at the cost of recall. I believe the misunderstanding comes from the information about the number of probes used when querying trading off speed for recall.

When querying, specify an appropriate number of probes (higher is better for recall, lower is better for speed) - a good place to start is sqrt(lists)

To speed up queries with an index, increase the number of inverted lists (at the expense of recall).
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);

Let me know if there are any questions, I would be happy to engage in a deeper discussion on these topics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

.NETIssue or Pull requests regarding .NET codeai connectorAnything related to AI connectorsmemory connector

Type

No type

Projects

  • Status

    Sprint: Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions