-
Couldn't load subscription status.
- Fork 102
chore: add documentation for Hybrid Seach #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -686,6 +686,108 @@ | |
| "1. For new records, added via `VectorStore` embeddings are automatically generated." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Hybrid Search Vector Store\n", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please provide the easy way to get started then a section for how to customize. |
||
| "\n", | ||
| "A Hybrid Search Vector Store combines multiple lookup strategies to provide more comprehensive and relevant search results. Specifically, it leverages both dense embedding vector search (for semantic similarity) and TSV (Text Search Vector) based keyword search (for lexical matching). This approach is particularly powerful for applications requiring efficient searching through customized text and metadata, especially when a specialized embedding model isn't feasible or necessary.\n", | ||
| "\n", | ||
| "By integrating both semantic and lexical capabilities, hybrid search helps overcome the limitations of each individual method:\n", | ||
| "\n", | ||
| "* **Semantic Search**: Excellent for understanding the meaning of a query, even if the exact keywords aren't present. However, it can sometimes miss highly relevant documents that contain the precise keywords but have a slightly different semantic context.\n", | ||
| "\n", | ||
| "* **Keyword Search**: Highly effective for finding documents with exact keyword matches and is generally fast. Its weakness lies in its inability to understand synonyms, misspellings, or conceptual relationships.\n", | ||
| "\n", | ||
| "With a `HybridSearchConfig` provided, the `PGVectorStore` class can efficiently manage a hybrid search vector store using PostgreSQL as the backend, automatically handling the creation and population of the necessary TSV columns when possible.\n", | ||
| "\n", | ||
| "\n", | ||
| "Assuming a pre-existing table same as above in PG DB: `products`, which stores product details for an eComm venture.\n", | ||
| "\n", | ||
| "Here is how this table mapped to `PGVectorStore`:\n", | ||
| "\n", | ||
| "- **`id_column=\"product_id\"`**: ID column uniquely identifies each row in the products table.\n", | ||
| "\n", | ||
| "- **`content_column=\"description\"`**: The `description` column contains text descriptions of each product. This text is used by the `embedding_service` to create vectors that go in embedding_column and represent the semantic meaning of each description.\n", | ||
| "\n", | ||
| "- **`embedding_column=\"embed\"`**: The `embed` column stores the vectors created from the product descriptions. These vectors are used to find products with similar descriptions.\n", | ||
| "\n", | ||
| "- **`metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"]`**: These columns are treated as metadata for each product. Metadata provides additional information about a product, such as its name, category, price, quantity available, SKU (Stock Keeping Unit), and an image URL. This information is useful for displaying product details in search results or for filtering and categorization.\n", | ||
| "\n", | ||
| "- **`metadata_json_column=\"metadata\"`**: The `metadata` column can store any additional information about the products in a flexible JSON format. This allows for storing varied and complex data that doesn't fit into the standard columns.\n" | ||
|
Comment on lines
+710
to
+718
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This info is already provided above. Please outline how to use the HybridSearchConfig. |
||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from langchain_postgres.v2 import PGVectorStore\n", | ||
| "from langchain_postgres.v2.hybrid_search_config import (\n", | ||
| " HybridSearchConfig,\n", | ||
| " reciprocal_rank_fusion,\n", | ||
| ")\n", | ||
| "\n", | ||
| "TABLE_NAME = \"hybrid_search_products\"\n", | ||
| "\n", | ||
| "hybrid_search_config = HybridSearchConfig(\n", | ||
| " tsv_column=\"hybrid_description\",\n", | ||
| " tsv_lang=\"pg_catalog.english\",\n", | ||
| " fusion_function=reciprocal_rank_fusion,\n", | ||
| " fusion_function_parameters={\n", | ||
| " \"rrf_k\": 60,\n", | ||
| " \"fetch_top_k\": 10,\n", | ||
| " },\n", | ||
| ")\n", | ||
| "\n", | ||
| "# If a hybrid search config is provided during vector store table creation,\n", | ||
| "# the specified TSV column will be automatically created.\n", | ||
| "await pg_engine.ainit_vectorstore_table(\n", | ||
| " table_name=TABLE_NAME,\n", | ||
| " # schema_name=SCHEMA_NAME,\n", | ||
| " vector_size=VECTOR_SIZE,\n", | ||
| " id_column=\"product_id\",\n", | ||
| " content_column=\"description\",\n", | ||
| " embedding_column=\"embed\",\n", | ||
| " metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"],\n", | ||
| " metadata_json_column=\"metadata\",\n", | ||
| " hybrid_search_config=hybrid_search_config,\n", | ||
| " store_metadata=True,\n", | ||
| ")\n", | ||
|
Comment on lines
+745
to
+758
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we create individual sections for each of these notes. The inline comments are hard to read. |
||
| "\n", | ||
| "\n", | ||
| "# If a hybrid search config is NOT provided during init_vectorstore_table (above),\n", | ||
| "# but only provided during PGVectorStore creation, the specified TSV column\n", | ||
| "# is not present and TSV vectors are created dynamically on-the-go for hybrid search.\n", | ||
| "vs_hybrid = await PGVectorStore.create(\n", | ||
| " pg_engine,\n", | ||
| " table_name=TABLE_NAME,\n", | ||
| " # schema_name=SCHEMA_NAME,\n", | ||
| " embedding_service=embedding,\n", | ||
| " # Connect to existing VectorStore by customizing below column names\n", | ||
| " id_column=\"product_id\",\n", | ||
| " content_column=\"description\",\n", | ||
| " embedding_column=\"embed\",\n", | ||
| " metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"],\n", | ||
| " metadata_json_column=\"metadata\",\n", | ||
| " hybrid_search_config=hybrid_search_config,\n", | ||
| ")\n", | ||
| "\n", | ||
| "# Optionally, create an index on hybrid search column name\n", | ||
| "await vs_hybrid.aapply_hybrid_search_index()\n", | ||
| "\n", | ||
| "# Fetch documents from the previopusly created store to fetch product documents\n", | ||
| "docs = await custom_store.asimilarity_search(\"products\", k=5)\n", | ||
| "# Add data normally to the vector store, which will also add the tsv values in tsv_column\n", | ||
| "await vs_hybrid.aadd_documents(docs)\n", | ||
| "\n", | ||
| "# Use hybrid search\n", | ||
| "hybrid_docs = await vs_hybrid.asimilarity_search(\"products\", k=5)\n", | ||
| "print(hybrid_docs)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a note in the Readme is helpful. We should make sure the code snippet is clear and concise. So instead of this, can we add a header for Hybrid search and the smallest code snippet to get started then link to the how-to?