Language: English | 日本語
A Plone add-on that brings semantic vector search to your content management system. It converts text into vector embeddings using LLM-based models and finds semantically similar content through multi-stage approximate nearest neighbor search.
- VectorIndex for ZCatalog: A custom catalog index that stores and searches vector embeddings alongside traditional Plone indexes
- Multi-Stage Approximate Search: Three search algorithms with automatic fallback:
- Exhaustive Cosine (default): Brute-force cosine similarity on all documents
- ITQ-LSH 2-Stage: Hamming distance ranking via ITQ binary hashes, then cosine similarity on top-K candidates
- ITQ-LSH 3-Stage: Pivot-based triangle inequality filtering, then Hamming ranking, then cosine similarity
- Multiple Embedding Models:
- All-MiniLM-L6-v2 (default, 384 dimensions, English, FastEmbed/CPU)
- E5-Base Multilingual (768 dimensions, 100+ languages, FastEmbed/CPU)
- E5-Base Multilingual GPU (768 dimensions, GPU-accelerated, requires
[gpu]extras)
- Annotation-Based Data Storage: Vector data stored in content annotations as the single source of truth
- FastEmbed by Default: CPU-friendly ONNX-optimized embeddings, no GPU required
- Optional GPU Support: Install
[gpu]extras for GPU-accelerated processing via PyTorch and Sentence Transformers - Control Panel: Configure models, search algorithms, and parameters via Site Setup
- Pluggable Architecture: Add new embedding model providers from external packages
- Plone 6.0 or 6.1+
- Python 3.10 - 3.13
Install collective.vectorsearch by adding it to your buildout:
[buildout]
...
eggs =
collective.vectorsearch
and then running bin/buildout.
Or install via pip:
pip install collective.vectorsearch
For GPU-accelerated embedding with PyTorch and Sentence Transformers:
pip install collective.vectorsearch[gpu]
Or in buildout:
[buildout]
...
eggs =
collective.vectorsearch [gpu]
- Install the package via Site Setup -> Add-ons
- Go to Site Setup -> Vector Search to configure the embedding model
- The
llm_vectorindex is automatically added toportal_catalog - Content is automatically vectorized when created or modified
- Use the "Reindex All" button in the control panel to vectorize existing content
When content is created or modified, event subscribers automatically compute embeddings and store them in content annotations. The catalog indexers then read from these annotations to populate the VectorIndex and supporting indexes (pivot1-8, itq_hashes).
Content created/modified
|
+-- Event subscriber: compute_and_store_vectors()
| +-- Embed text using configured model
| +-- Compute ITQ binary hashes (128-bit)
| +-- Compute pivot distances (8 pivots)
| +-- Store all data in content annotations
|
+-- Catalog indexing
+-- VectorIndex: reads vectors from annotations
+-- pivot1-8 KeywordIndex: reads pivot distances
+-- itq_hashes metadata: reads ITQ hashes
The package implements a multi-stage approximate nearest neighbor search based on the lsh-cascade-poc research:
- Exhaustive Cosine (
exhaustive_cosine): - Computes cosine similarity against all indexed documents. Most accurate but slowest for large datasets.
- ITQ-LSH 2-Stage (
itq_lsh_2stage): - Compute query ITQ hash and rank all documents by Hamming distance
- Compute cosine similarity on the top-K candidates (
itq_candidates, default: 100)
- ITQ-LSH 3-Stage (
itq_lsh_3stage): - Pivot filtering: Use 8 pivot distances with triangle inequality to narrow candidates via KeywordIndex range queries
- Hamming ranking: Rank remaining candidates by ITQ Hamming distance, keep top-K
- Cosine similarity: Precise scoring on final candidates
The system automatically falls back: 3-stage -> 2-stage -> exhaustive if the required ITQ or pivot data is unavailable.
Access the control panel at Site Setup -> Vector Search to configure:
- Embedding Model: Select the model for generating embeddings
- Text Chunk Size: Maximum characters per chunk (100-10,000, default: 500)
- Approximation Algorithm: Search strategy (exhaustive_cosine, itq_lsh_2stage, itq_lsh_3stage)
- Pivot Threshold (Stage 1): Filtering threshold for pivot-based search (cosine distance x 1000, default: 200)
- ITQ Candidates (Stage 2): Number of candidates after Hamming ranking (default: 100)
- Storage Backend: Currently supports BTrees (internal storage)
| Model | Dimensions | GPU | Extras |
|---|---|---|---|
| All-MiniLM-L6-v2 (FastEmbed) | 384 | No | (default) |
| E5 Base Multilingual (FastEmbed) | 768 | No | (default) |
| E5 Base Multilingual (GPU) | 768 | Yes | [gpu] |
The package adds a VectorIndex named llm_vector to the portal catalog.
You can query it programmatically:
from plone import api
catalog = api.portal.get_tool('portal_catalog')
index = catalog.Indexes['llm_vector']
# Search for similar content
results = index.query_index(record)
You can add additional VectorIndex instances via ZMI:
- Navigate to
/Plone/portal_catalog/manage_main - Select "VectorIndex" from the index type dropdown
- Enter an ID and optionally specify indexed attributes (comma-separated)
External packages can add new embedding models by implementing IEmbeddingModelProvider:
from collective.vectorsearch.model_providers import BaseEmbeddingModelProvider
class MyCustomProvider(BaseEmbeddingModelProvider):
id = 'my-custom-model'
title = u'My Custom Model'
description = u'Custom model description'
model_name = 'my-org/my-model'
vector_dimensions = 768
# Backend configuration
backend = 'fastembed' # or 'sentence_transformers'
backend_name = u'FastEmbed (CPU/ONNX)'
requires_gpu = False
extras_name = None # or 'gpu' for [gpu] extras
Register it in your package's configure.zcml:
<utility
factory=".providers.MyCustomProvider"
provides="collective.vectorsearch.interfaces.IEmbeddingModelProvider"
name="my-custom-model"
/>
FastEmbed downloads models on first use. For offline environments, pre-download models using the CLI command:
vectorsearch-download
This downloads all supported models to ~/.cache/fastembed.
Set FASTEMBED_CACHE_PATH environment variable to use a different location.
After reinstalling or upgrading this package, you must restart the Plone/Zope server. Without a restart, the model provider utilities may not be properly registered.
Recommended procedure:
- Reinstall or upgrade the package via Site Setup -> Add-ons
- Restart the Plone/Zope server
- Go to Site Setup -> Vector Search
- Click "Reindex All" to rebuild the vector index
Warning: Uninstalling this package will delete all vector data from the catalog.
The llm_vector index, pivot indexes, and all embeddings will be permanently removed.
If you need to preserve vector data while updating the package code, use the Upgrade feature instead of uninstall/reinstall.
To set up a development environment:
git clone https://github.com/collective/collective.vectorsearch.git cd collective.vectorsearch make install
Run tests:
make test
See DEVELOP.rst for detailed development instructions.
- Manabu TERADA (@terapyon)
- (Your name here)
- Issue Tracker: https://github.com/collective/collective.vectorsearch/issues
- Source Code: https://github.com/collective/collective.vectorsearch
If you are having issues, please open an issue on GitHub: https://github.com/collective/collective.vectorsearch/issues
The project is licensed under the GPLv2.