minsearch

A minimalistic search engine that provides both text-based and vector-based search capabilities. The library provides three implementations:

Index: A basic search index using scikit-learn's TF-IDF vectorizer for text fields
AppendableIndex: An appendable search index using an inverted index implementation that allows for incremental document addition
VectorSearch: A vector search index using cosine similarity for pre-computed vectors

Features

Text field indexing with TF-IDF and cosine similarity
Vector search with cosine similarity for pre-computed embeddings
Keyword field filtering with exact matching
Field boosting for fine-tuning search relevance (text-based search)
Stop word removal and custom tokenization
Support for incremental document addition (AppendableIndex)
Customizable tokenizer patterns and stop words
Efficient search with filtering and boosting

Installation

pip install minsearch

Environment setup

For development purposes, use uv:

# Install uv if you haven't already
pip install uv
uv sync --extra dev

Usage

Basic Search with Index

from minsearch import Index

# Create documents
docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available.",
        "section": "General Information",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "What are the prerequisites for the course?",
        "text": "You need to have basic knowledge of programming.",
        "section": "Course Requirements",
        "course": "data-engineering-zoomcamp"
    }
]

# Create and fit the index
index = Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)
index.fit(docs)

# Search with filters and boosts
query = "Can I join the course if it has already started?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}

results = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)

Incremental Search with AppendableIndex

from minsearch import AppendableIndex

# Create the index
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"]
)

# Add documents one by one
doc1 = {"title": "Python Programming", "description": "Learn Python programming", "course": "CS101"}
index.append(doc1)

doc2 = {"title": "Data Science", "description": "Python for data science", "course": "CS102"}
index.append(doc2)

# Search with custom stop words
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    stop_words={"the", "a", "an"}  # Custom stop words
)

Vector Search with VectorSearch

from minsearch import VectorSearch
import numpy as np

# Create sample vectors and payload documents
vectors = np.random.rand(100, 768)  # 100 documents, 768-dimensional vectors
payload = [
    {"id": 1, "title": "Python Tutorial", "category": "programming", "level": "beginner"},
    {"id": 2, "title": "Data Science Guide", "category": "data", "level": "intermediate"},
    {"id": 3, "title": "Machine Learning Basics", "category": "ai", "level": "advanced"},
    # ... more documents
]

# Create and fit the vector search index
index = VectorSearch(keyword_fields=["category", "level"])
index.fit(vectors, payload)

# Search with a query vector
query_vector = np.random.rand(768)  # 768-dimensional query vector
filter_dict = {"category": "programming", "level": "beginner"}

results = index.search(query_vector, filter_dict=filter_dict, num_results=5)

Advanced Features

Custom Tokenizer Pattern

from minsearch import AppendableIndex

# Create index with custom tokenizer pattern
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    tokenizer_pattern=r'[\s\W\d]+'  # Custom pattern to split on whitespace, non-word chars, and digits
)

Field Boosting (Text-based Search)

# Boost certain fields to increase their importance in search
boost_dict = {
    "title": 2.0,      # Title matches are twice as important
    "description": 1.0  # Normal importance for description
}
results = index.search("python", boost_dict=boost_dict)

Keyword Filtering

# Filter results by exact keyword matches
filter_dict = {
    "course": "CS101",
    "level": "beginner"
}
results = index.search("python", filter_dict=filter_dict)

Examples

Interactive Notebook

The repository includes an interactive Jupyter notebook (minsearch_example.ipynb) that demonstrates the library's features using real-world data. The notebook shows:

Loading and preparing documents from a JSON source
Creating and configuring the search index
Performing searches with filters and boosts
Working with real course-related Q&A data

To run the notebook:

uv run jupyter notebook

Then open minsearch_example.ipynb in your browser.

Development

Running Tests

uv run pytest

Building and Publishing

Install development dependencies:

uv sync --extra dev

Build the package:

uv run hatch build

Publish to test PyPI:

uv run hatch publish --repo test

Publish to PyPI:

uv run hatch publish

Clean up:

rm -r dist/

Or run

python publish.py

Note: For Hatch publishing, you'll need to configure your PyPI credentials in ~/.pypirc or use environment variables.

PyPI Credentials Setup

Create a .pypirc file in your home directory with your PyPI credentials:

[distutils]
index-servers =
    pypi
    testpypi

[pypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = pypi-your-main-api-token-here

[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-your-test-api-token-here

Important Notes:

Use __token__ as the username for API tokens
Get your tokens from PyPI and Test PyPI
Set file permissions: chmod 600 ~/.pypirc

Alternative: Environment Variables

export HATCH_INDEX_USER=__token__
export HATCH_INDEX_AUTH=your-pypi-token

Project Structure

minsearch/: Main package directory
- minsearch.py: Core Index implementation using scikit-learn
- append.py: AppendableIndex implementation with inverted index
- vector.py: VectorSearch implementation using cosine similarity
tests/: Test suite
minsearch_example.ipynb: Example notebook
setup.py: Package configuration
Pipfile: Development dependencies

Note: The minsearch.py file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
.vscode		.vscode
minsearch		minsearch
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
minsearch.py		minsearch.py
minsearch_example.ipynb		minsearch_example.ipynb
publish.py		publish.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

minsearch

Features

Installation

Environment setup

Usage

Basic Search with Index

Incremental Search with AppendableIndex

Vector Search with VectorSearch

Advanced Features

Custom Tokenizer Pattern

Field Boosting (Text-based Search)

Keyword Filtering

Examples

Interactive Notebook

Development

Running Tests

Building and Publishing

PyPI Credentials Setup

Project Structure

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

alexeygrigorev/minsearch

Folders and files

Latest commit

History

Repository files navigation

minsearch

Features

Installation

Environment setup

Usage

Basic Search with Index

Incremental Search with AppendableIndex

Vector Search with VectorSearch

Advanced Features

Custom Tokenizer Pattern

Field Boosting (Text-based Search)

Keyword Filtering

Examples

Interactive Notebook

Development

Running Tests

Building and Publishing

PyPI Credentials Setup

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages