Skip to content

High-performance ISCC similarity search engine

License

iscc/iscc-search

Repository files navigation

iscc-search

Release Build status codecov Commit activity License

Warning

This project is in early development and not ready for production use.

The API and features are subject to significant changes. Use at your own risk.

High-performance ISCC similarity search engine for variable-length binary ISCC codes with fast approximate nearest neighbor search.

Features

  • Fast approximate nearest neighbor search (ANNS) for variable-length binary vectors
  • Custom NPHD (Normalized Prefix Hamming Distance) metric optimized for ISCC codes
  • Support for 64-256 bit vectors (8-32 bytes)
  • Built on usearch with JIT-compiled Numba metrics
  • Cross-platform support (Linux, macOS, Windows)
  • Python 3.10-3.13 support

What is ISCC?

The International Standard Content Code (ISCC) is a similarity-preserving content identifier for digital media. ISCC codes are variable-length binary vectors that enable efficient similarity search across different media types. This library provides a specialized vector database for storing and querying ISCC codes at scale.

Installation

pip install iscc-search

For development installation:

git clone https://github.com/iscc/iscc-search.git
cd iscc-search
uv sync

Quick Start

from iscc_search import NphdIndex
import numpy as np

# Create index for up to 256-bit vectors
index = NphdIndex(max_dim=256)

# Add some binary vectors with integer keys
vectors = [
    np.array([18, 52, 86, 120], dtype=np.uint8),  # 32-bit vector
    np.array([171, 205, 239], dtype=np.uint8),  # 24-bit vector
    np.array([17, 34, 51, 68, 85], dtype=np.uint8),  # 40-bit vector
]
keys = [1, 2, 3]
index.add(keys, vectors)

# Search for similar vectors
query = np.array([18, 52, 86, 121], dtype=np.uint8)
matches = index.search(query, k=2)

print(f"Found {len(matches.keys)} matches")
print(f"Keys: {matches.keys}")
print(f"Distances: {matches.distances}")

API Overview

NphdIndex

The main index class for ANNS with variable-length binary vectors.

NphdIndex(max_dim=256, **kwargs)
  • max_dim: Maximum vector dimension in bits (default: 256)
  • **kwargs: Additional arguments passed to usearch Index

Methods

  • add(keys, vectors): Add vectors with integer keys
  • search(query, k): Search for k nearest neighbors
  • get(keys): Retrieve vectors by keys
  • remove(keys): Remove vectors by keys

Development

This project uses uv for package management and poethepoet for task automation.

Prerequisites

  • Python 3.10 or higher
  • uv package manager

Available Commands

uv run poe format-code      # Format Python code with ruff
uv run poe format-markdown  # Format markdown files
uv run poe format           # Format all files
uv run poe test             # Run tests with coverage (requires 100%)
uv run poe precommit        # Run pre-commit hooks
uv run poe all              # Format and test

Running Tests

# Run all tests with coverage
uv run poe test

# Run specific test
uv run pytest tests/test_nphd.py::test_pad_vectors

# Run tests in watch mode
uv run pytest --watch

Technical Details

NPHD Metric

The Normalized Prefix Hamming Distance (NPHD) is a valid metric specifically designed for variable-length prefix-compatible codes like ISCC. It normalizes the Hamming distance by the length of the common prefix, enabling meaningful similarity comparisons between vectors of different lengths.

Unlike standard Hamming distance, NPHD:

  • Correctly handles variable-length comparisons
  • Normalizes over common prefix length
  • Satisfies all metric axioms (non-negativity, identity, symmetry, triangle inequality)

Binary Vector Format

Vectors are stored as packed binary arrays (np.uint8) with an internal length prefix:

  • Each vector is prefixed with a length byte
  • Vectors are padded to uniform size for efficient indexing
  • pad_vectors() and unpad_vectors() handle conversions automatically

Custom usearch Build

This project uses custom usearch 2.21.0 wheels with platform-specific builds hosted at iscc.github.io to ensure consistent behavior across platforms.

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please ensure:

  • All tests pass (uv run poe test)
  • Code is formatted (uv run poe format)
  • Coverage remains at 100%
  • Changes are documented

See CONTRIBUTING.md for details.


Repository initiated with fpgmaas/cookiecutter-uv.

About

High-performance ISCC similarity search engine

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published