Document Embeddings

Repository dedicated to illustrating the usage of vector embeddings for document search.

Setup

Install the required Python packages:

pip install einops transformers torch

Download necessary models from Hugging Face:

from huggingface_hub import snapshot_download

# Download models
snapshot_download("jinaai/xlm-roberta-flash-implementation")
snapshot_download("jinaai/jina-embeddings-v3")  # This might take a while, grab a coffee!

Generate Embeddings

Run the following script to generate embeddings:

python scripts/embeddings.py -cc Review -qc Name tests/fixtures/query.parquet tests/fixtures/corpus.parquet embeddings/ --model-path jinaai/jina-embeddings-v3

-cc and -qc are column selectors for the corpus and query respectively.

Vector Search

Set up and execute vector search using DuckDB and the VSS extension:

duckdb

Install and load the VSS extension:

INSTALL vss;
LOAD vss;

Create tables for your corpus and query:

CREATE TABLE Corpus(
    Review VARCHAR,
    Review_embedding FLOAT[1024]
);

CREATE TABLE Query (
    Name VARCHAR,
    Name_embedding FLOAT[1024] 
);

Insert data from the parquet files:

INSERT INTO Corpus (Review, Review_embedding)
  SELECT Review, 
    CAST(Review_embedding AS FLOAT[1024])
  FROM read_parquet('embeddings/corpus.parquet');

INSERT INTO Query (Name, Name_embedding)
  SELECT Name, 
    CAST(Name_embedding AS FLOAT[1024])
  FROM read_parquet('embeddings/query.parquet');

Create HNSW indices for efficient vector search:

CREATE INDEX my_hnsw_index ON Corpus USING HNSW (Review_embedding);
CREATE INDEX my_hnsw_index2 ON Query USING HNSW (Name_embedding);

Execute the search query to find distances:

SELECT 
    q.Name AS Query_Name,
    c.Review AS Corpus_Review,
    array_distance(q.Name_embedding, c.Review_embedding) AS distance
FROM 
    Query q
CROSS JOIN 
    Corpus c
ORDER BY 
    Query_Name, distance;

Get the top results for each query:

SELECT 
    Query_Name,
    Corpus_Review,
    distance
FROM (
    SELECT 
        q.Name AS Query_Name,
        c.Review AS Corpus_Review,
        array_distance(q.Name_embedding, c.Review_embedding) AS distance,
        ROW_NUMBER() OVER (PARTITION BY q.Name ORDER BY array_distance(q.Name_embedding, c.Review_embedding) ASC) AS rank
    FROM 
        Query q
    CROSS JOIN 
        Corpus c
) ranked
WHERE rank <= 2
ORDER BY Query_Name, rank;

Example Application

An example application is accessible here

Run the example application:

cd example && bun run dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Embeddings

Setup

Generate Embeddings

Vector Search

Example Application

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
embeddings		embeddings
example		example
scripts		scripts
tests/fixtures		tests/fixtures
LICENSE		LICENSE
README.md		README.md

License

sondalex/document-embeddings

Folders and files

Latest commit

History

Repository files navigation

Document Embeddings

Setup

Generate Embeddings

Vector Search

Example Application

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages