Skip to content

Support push-down filters on vss indices #2

Open
@asg017

Description

@asg017

tl;dr "The Missing WHERE Clause in Vector Search" describes this exact problem

Problem

Say your database looks like this:

create table articles(
  headline text, 
  headline_embedding blob,
  newsroom text,
  published_at date
);

create virtual table vss_articles using vss(
  headline_embedding(384)
);

Most queries go like "show me the 10 closest articles that talk about 'global warming'", which is currently supported and easy:

select rowid
from vss_articles
where vss_search(headline_embedding, embedding_api('global warming'))
limit 10;

Now you have the rowids of the 10 closest articles that talk about "global warming", and you're good to go!

But now consider you want the same 10 closest articles, but with extra filters. Say "show me the 10 closest articles that talk about 'global warming', that published in 2016-2019 from the NY times".

.... well crap. You have a few options, but none will 100% get the correct answer. You could cast a wide net by getting the nearest 1,000 matches then filtering those:

w

Solution: indexed non-vector columns in vss0

create virtual table vss_articles using vss0(
  headline_embeddings(384),
  published_at date,
  newsroom text
);

insert into vss_articles
  select headline_embedding, published_at, newsroom
  from articles;
select rowid
from vss_articles
where vss_search(headline_embeddings, embedding_api('global warming'))
  and published_at between '2016-01-01' and '2019-12-31'
  and newsroom = 'NY Times'
limit 20;

Implementation

We could support non-vector columns in vss0, which get added to the vss_xyz_data shadow table. Then, we can create indices on those columns on the shadow tables like so:

create index idx_vss0_vss_articles_data_published_at 
  on vss_articles_data(published_at);
create index idx_vss0_vss_articles_data_newsroom 
  on vss_articles_data(newsroom);

Then, in xFilter, we determine "candidate vectors" by querying those new data columns using the new indicies. We'll get all the candidate rowids and pass it into SearchParameters->IDSelector, with maybe some perf addtions (bitmap, bloom filter, etc.).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions