Skip to content

Conversation

@thewebscraping
Copy link
Owner

@thewebscraping thewebscraping commented Nov 29, 2025

Standardize Query DSL and Enhance Adapter Architecture

Summary

This PR introduces a comprehensive Query DSL system with universal operator support across all vector database backends, enhances adapter capabilities, and improves the engine's document retrieval logic.

Key Changes

🎯 Query DSL System

  • New Q Class: Introduced composable query objects supporting 8 universal operators ($eq, $ne, $gt, $gte, $lt, $lte, $in, $nin)
  • Backend Compilers: Implemented compiler system for AstraDB, ChromaDB, Milvus, and PgVector
  • Universal Format: Standardized filter format that compiles to backend-specific syntax
  • Nested Field Support: Added field__subfield__operator syntax for nested metadata queries

🔧 Adapter Enhancements

PgVector

  • Added nested JSONB path support using #>> operator for dot-notation queries
  • Implemented numeric casting ::numeric for type-safe comparisons
  • Enhanced WHERE clause generation with proper SQL escaping

All Adapters

  • Added capability flags: supports_metadata_only, REQUIRES_VECTOR, SUPPORTS_NESTED
  • Integrated where compilers for automatic Q/dict to native format conversion
  • Improved error messages with structured exceptions
  • Added similarity score injection in search results

🚀 Engine Improvements

  • Enhanced get_or_create: Multi-step lookup strategy (pk → vector similarity → metadata)
  • Enhanced update_or_create: Improved logic with separate create_defaults support
  • Better Embedding Management: Automatic re-embedding on text updates
  • Type Safety: Added proper type hints and validation

⚙️ Configuration

  • Simplified settings keys (e.g., CHROMA_HOST instead of CHROMA_HTTP_HOST)
  • Added VECTOR_DIM setting for default embedding dimension
  • Standardized all adapter environment variable names

🧪 Testing

  • Added comprehensive Query DSL test suite (test_querydsl.py, test_compilers.py, test_adapters_where.py)
  • Updated engine tests for new get_or_create/update_or_create behavior
  • Improved mock adapters for better test coverage

📚 Documentation

  • Added docs/querydsl.md - Complete Query DSL guide
  • Added docs/architecture.md - System design and patterns
  • Updated docs/adapters/databases.md - Backend-specific capabilities
  • Updated all docs to reflect new API and features

🗑️ Cleanup

  • Removed deprecated scripts/tests/ directory (old integration test scripts)
  • Removed legacy querydsl/filters/ system
  • Consolidated utility functions

Breaking Changes

Configuration Keys

  • CHROMA_HTTP_HOSTCHROMA_HOST
  • CHROMA_HTTP_PORTCHROMA_PORT
  • CHROMA_CLOUD_TENANTCHROMA_TENANT
  • CHROMA_CLOUD_DATABASECHROMA_DATABASE
  • MILVUS_USER + MILVUS_PASSWORDMILVUS_API_KEY

API Changes

  • where parameter now accepts Q objects or universal dict format
  • Search results now include score in metadata when available
  • Adapter class names: ChromaDBAdapterChromaAdapter, MilvusDBAdapterMilvusAdapter, PGVectorAdapterPgVectorAdapter

Migration Guide

Updating Configuration

# Old
os.environ["CHROMA_HTTP_HOST"] = "localhost"
os.environ["MILVUS_USER"] = "user"
os.environ["MILVUS_PASSWORD"] = "pass"

# New
os.environ["CHROMA_HOST"] = "localhost"
os.environ["MILVUS_API_KEY"] = "token"  # or leave empty for local

Using Query DSL

# Old - dict with implicit $eq
results = engine.search(where={"category": "tech", "year": 2024})

# New - Q objects with explicit operators
from crossvector.querydsl import Q

results = engine.search(where=Q(category="tech") & Q(year__gte=2024))
# Or still use universal dict format
results = engine.search(where={"category": {"$eq": "tech"}, "year": {"$gte": 2024}})

Adapter Imports

# Old
from crossvector.dbs.chroma import ChromaDBAdapter
from crossvector.dbs.milvus import MilvusDBAdapter
from crossvector.dbs.pgvector import PGVectorAdapter

# New
from crossvector.dbs.chroma import ChromaAdapter
from crossvector.dbs.milvus import MilvusAdapter
from crossvector.dbs.pgvector import PgVectorAdapter

Testing

All tests passing:

  • ✅ Unit tests: 62/62 passing
  • ✅ Query DSL tests: All compilers validated
  • ✅ Engine tests: Updated for new behavior
  • ✅ Backend compatibility: Tested with all 4 databases
pytest tests/
# 62 passed

Backend Compatibility Matrix

Feature AstraDB ChromaDB Milvus PgVector
Metadata-only search
Nested metadata ⚠️ Flattened
All 8 operators
Numeric comparisons ✅ (with casting)
Vector similarity

Documentation

Complete documentation available:

Related Issues

Closes #[issue-number] (if applicable)

Checklist

  • Tests added/updated
  • Documentation updated
  • Breaking changes documented
  • All tests passing
  • Code formatted with ruff
  • Type hints added
  • Pre-commit hooks passing

- Remove wrapper types (UpsertRequest, SearchRequest, VectorStatus, UpsertInput)
- Engine methods now return ABC types directly (VectorDocument instead of dicts)
- Add helper methods: create_from_texts, upsert_from_texts, update_from_texts
- Remove types.py - replace DocumentIds with Union[str, Sequence[str]]
- Remove unused functions: normalize_documents, extract_unique_query
- Remove Document and normalize_documents from public API exports
- Add utils helpers: normalize_texts, normalize_metadatas, normalize_pks
- Enhanced search with offset and where filtering across all adapters
- Remove unique_fields parameter (only used by 1 of 4 adapters)
- Add collection management: add_collection, get_collection, get_or_create_collection
- Updated Quick Start examples to use create_from_texts() helper
- Added PRIMARY_KEY_MODE configuration docs
- Fixed test fixtures to return dict with texts/metadatas/pks
- Updated all test methods to use new API (no more wrapper types)
- Removed test_flexible_input.py (tests removed internal functions)
- Added missing ABC methods to MockDBAdapter (create, get_or_create, update, update_or_create)
- Fixed normalize_pks() to pad list with None values
- Fixed VectorDocument construction to use 'id' parameter instead of 'pk'

All engine tests now pass with the simplified API.
- Rename VectorDocument class (backward compat alias maintained)
- Remove SearchRequest/UpsertRequest wrappers - use direct method calls
- Add private _vector attribute with emb property
- Move generate_pk and helpers from schema to utils
- Reorganize utils.py into logical sections
- Update all docs to reflect new API and PK generation modes
- Fix integration tests to use new engine methods
- Delete obsolete test_schema.py
@thewebscraping thewebscraping changed the title Refactor: Simplify API and Reorganize Core Components refactor: standardize logging, exceptions, and settings across adapters Nov 29, 2025
@thewebscraping thewebscraping force-pushed the standard-dbs branch 8 times, most recently from 6c361b0 to 6d21ca6 Compare November 29, 2025 20:16
- Add Logger class with configurable LOG_LEVEL setting
- Replace all module loggers with Logger class
- Use specific exceptions from exceptions.py instead of generic ValueError/Exception
- Replace os.getenv with direct api_settings access
- Update all adapters (astradb, chroma, milvus, pgvector) with consistent patterns
- Add Query DSL with Q class supporting 8 universal operators
- Implement backend-specific compilers for all databases
- Enhance PgVector with nested JSONB and numeric casting
- Add capability flags to adapters
- Improve get_or_create/update_or_create with multi-step lookup
- Standardize configuration settings
- Add comprehensive Query DSL test suite
- Remove deprecated test scripts
@thewebscraping thewebscraping changed the title refactor: standardize logging, exceptions, and settings across adapters Standardize Query DSL and Enhance Adapter Architecture Nov 30, 2025
@thewebscraping thewebscraping changed the title Standardize Query DSL and Enhance Adapter Architecture feat: standardize query dsl and enhance adapter architecture Nov 30, 2025
@thewebscraping thewebscraping changed the title feat: standardize query dsl and enhance adapter architecture feat: Standardize Query DSL and Enhance Adapter Architecture Nov 30, 2025
- Create tests/searches/ directory for backend integration tests
- Move test_search_*.py to tests/searches/test_*.py for clarity
- Add comprehensive README.md documenting test structure and requirements
- Add __init__.py with package documentation
- Tests now organized by functionality (searches) rather than mixed with unit tests
- Add scripts/tests/ with real backend integration tests
- Add tests/mock/ with in-memory adapter for DSL testing
- Fix Milvus operator mapping (IN/NOT IN uppercase)
- Document opt-in test strategy in README.md
- Remove deprecated tests/searches/ directory
Version 0.1.3 (2025-11-30):
- Test infrastructure reorganization (scripts/tests/ + tests/mock/)
- Query DSL improvements (Milvus operator fix)
- CI/CD updates (unit tests only in GitHub Actions)
- Documentation enhancements (README integration test guide)
- Bug fixes (fixture imports, unused variables)

Version 0.1.2 (2025-11-23):
- Refactor design with architecture improvements
- Enhanced Query DSL design patterns
- Improved adapter interface consistency

Bump version: 0.1.0 -> 0.1.3
Resolved conflict in pyproject.toml:
- Keep version 0.1.3 from standard-dbs branch
- Main branch had version 0.1.1
- Renamed tests/mock/test_common_mock.py → tests/test_querydsl_operators.py
- Moved InMemoryAdapter and fixtures to tests/conftest.py for global availability
- Removed tests/mock/ directory (no longer needed)
- Fixed fixture import issues that were causing pytest to hang
- All 77 unit tests passing (33% coverage)
@thewebscraping thewebscraping merged commit 7fb504d into main Nov 30, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants