Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: postgresql store #157

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

milancermak
Copy link

Resolves #5

I've modeled the crate on the SQLite one.

On the DB side, the crate is relying on the pgvector extension. It creates an HSNW index and uses the cosine distane operator to compare embeddings.

Please let me know it this looks ok or if there are any parts of the code that should be changed.

@0xMochan
Copy link
Contributor

0xMochan commented Dec 18, 2024

Thanks for your contribution, nearly done with my review but I'll go ahead and approve and run the workflow. I think you might need to cargo fmt and pass cargo clippy as well.

@b0xtch
Copy link

b0xtch commented Dec 20, 2024

Hey great work, any update on this?

@0xMochan
Copy link
Contributor

Hey great work, any update on this?

I'm still in my review, I got wrapped up in some other responsibilities, will get it out ASAP!

Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay on this review! Mostly looking good, a couple of changes I've requested here.

One note: I noticed that 'static was used a good bit here which was likely inherited from the sqlite implementation. I think this is due to how the connection requires an closure with an async block which makes borrowing and using such things really odd.

Recently, proper async closures which will get into Rust stable sometime in 2025 which should help us alleviate our usage of 'static here (i think!)

// tracing_subscriber::EnvFilter::from_default_env()
// .add_directive(tracing::Level::DEBUG.into())
// .add_directive("hyper=off".parse().unwrap())
// ).init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either leave this out or add this snippet back to the example!

let placeholders = placeholders.join(", ");

let query_string = &format!(
"INSERT INTO {} ({}) VALUES ({})",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses Rust string interpolation to evaluate the query string which introduces SQL injection risk. Nearly every instance of sql statements made in this crate can be converted to prepared statements to avoid this problem. This is a better practice to have since it will lead to a harder situation of user input leading to problems down the line (imagine prompt engineering -> hacking -> SQL injection 🤯).

This stackoverflow post has more information on this can be refactored.


EDIT: I realized you used prepared statements later, was there a specific reason you didn't use it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed typed, prepared statements. I'm not too familiar with this, might be interested to look into.

.iter()
.map(|(_, v)| &**v)
.chain(std::iter::once(&embedding_vector as &(dyn ToSql + Sync)))
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation seems a bit odd to me, I'm not sure why iterators are used here. I would leave a comment explaining what you are doing to prepare your parameters here.

/// to store embeddings and perform similarity searches.
///
/// # Example
/// ```rust,ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this "```rust,ignore"
Does this disable doctests?

<span style="font-size: 48px; margin: 0 20px; font-weight: regular; font-family: Open Sans, sans-serif;"> + </span>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://www.postgresql.org/media/img/about/press/elephant.png">
<source media="(prefers-color-scheme: light)" srcset="https://www.postgresql.org/media/img/about/press/elephant.png">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the same logo for both, you prolly could just remove the media="(prefers-color-scheme: dark)" and just keep one <source>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also include basic instructions (esp. what a DATABASE_URL should be) and some links to install postgres.

Comment on lines +309 to +314
let mut map = serde_json::Map::new();
for (i, column) in row.columns().iter().enumerate().take(rlen - 1) {
let name = column.name();
let value = serde_json::Value::String(row.get(i));
map.insert(name.to_string(), value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious what this would look like functionally (copilot assisted 😉)

Suggested change
let mut map = serde_json::Map::new();
for (i, column) in row.columns().iter().enumerate().take(rlen - 1) {
let name = column.name();
let value = serde_json::Value::String(row.get(i));
map.insert(name.to_string(), value);
}
let map = row.columns().iter().enumerate().take(rlen - 1).fold(
serde_json::Map::new(),
|mut acc, (i, column)| {
acc.insert(
column.name().to_string(),
serde_json::Value::String(row.get(i)),
);
acc
},
);

Pretty neat, you don't need to accept this unless you like functional design better.

if let Err(e) = connection.await {
tracing::error!("Connection error: {}", e);
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, why are you spawning a task for connecting, why can't you just await the connection directly?

@milancermak
Copy link
Author

Thanks for the review. I'm still in holiday mode until New Year's, will address the comments on Jan 2 or 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Add support for PostgreSQL vector store
3 participants