Skip to content

Add SQL filter compilation for properties JSON#1291

Open
edwinyyyu wants to merge 2 commits intoMemMachine:mainfrom
edwinyyyu:sql_json_properties
Open

Add SQL filter compilation for properties JSON#1291
edwinyyyu wants to merge 2 commits intoMemMachine:mainfrom
edwinyyyu:sql_json_properties

Conversation

@edwinyyyu
Copy link
Copy Markdown
Contributor

@edwinyyyu edwinyyyu commented Apr 1, 2026

Purpose of the change

Needed for SQLAlchemySegmentStore, SQLite VectorStore implementations.

Description

Made based on #1302 for easier rebasing, but this does not need it.

Add unified, well-tested way to encode/decode properties as SQL JSON to avoid repeated reimplementation.

Consumers of existing behavior are modified to use raw JSON encoding -- they may be moved to properties JSON in the future if appropriate. In particular, existing consumers of existing compile_sql_filter do not handle datetimes (no way of handling type collisions -- the new method uses a discriminated union).

Type of change

Breaking: filters will no longer ignore unknown keys.

It's confusing and potentially dangerous (especially with more complex filters including NOT, AND, OR) what treating unknown keys as None is supposed to mean. New behavior: If a filter is meant to be sent to multiple memories simultaneously, then all memories must support those fields.

It maybe would be okay if it obeyed SQL 3VL, but there is no case where a filter on one memory's field should affect the filter for another memory.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g., code style improvements, linting)
  • Documentation update
  • Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
  • Security (improves security without changing functionality)

How Has This Been Tested?

  • Unit Test
  • Integration Test
  • End-to-end Test
  • Test Script (please provide)
  • Manual verification (list step-by-step instructions)

Checklist

[Please delete options that are not relevant.]

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

@edwinyyyu edwinyyyu force-pushed the sql_json_properties branch from 6ccc211 to 7715ed4 Compare April 1, 2026 22:02
@edwinyyyu edwinyyyu marked this pull request as draft April 1, 2026 22:38
@edwinyyyu
Copy link
Copy Markdown
Contributor Author

edwinyyyu commented Apr 1, 2026

Need to support system-defined metadata too. Will combine with existing compile_sql_filter.

@edwinyyyu edwinyyyu force-pushed the sql_json_properties branch from 7715ed4 to a9e83e5 Compare April 2, 2026 00:19
@edwinyyyu edwinyyyu marked this pull request as ready for review April 2, 2026 00:20
@edwinyyyu edwinyyyu changed the title Add SQL JSON properties utilities Add SQL filter compilation for properties JSON Apr 2, 2026
@edwinyyyu
Copy link
Copy Markdown
Contributor Author

Discovered client tests were wrong all along -- filters silently failed before. Need to fix.

@edwinyyyu edwinyyyu force-pushed the sql_json_properties branch from 842325b to 631cc96 Compare April 2, 2026 05:21
@edwinyyyu edwinyyyu modified the milestones: v0.3.5, v0.3.4 Apr 2, 2026
@edwinyyyu edwinyyyu force-pushed the sql_json_properties branch 3 times, most recently from cea64ef to 6d3fd90 Compare April 6, 2026 20:30
@edwinyyyu edwinyyyu requested a review from jealous April 6, 2026 22:38
@edwinyyyu edwinyyyu force-pushed the sql_json_properties branch from 6d3fd90 to e1e6dd6 Compare April 6, 2026 22:43
Signed-off-by: Edwin Yu <edwinyyyu@gmail.com>
Signed-off-by: Edwin Yu <edwinyyyu@gmail.com>
@edwinyyyu edwinyyyu force-pushed the sql_json_properties branch from e1e6dd6 to 496802d Compare April 6, 2026 23:59
@sscargal
Copy link
Copy Markdown
Contributor

sscargal commented Apr 7, 2026

Breaking Changes should be delayed or moved to 0.4.0, IMHO. We're not strictly following SemVer and are currently using MINOR versions for breaking changes.

@sscargal
Copy link
Copy Markdown
Contributor

sscargal commented Apr 7, 2026

Reviewed these changes with CoPilot.

What Changed

The breaking change is in compile_sql_filter and all _resolve_field callbacks across the codebase.

The Core Change: Unknown Filter Fields Now Raise Instead of Silently Returning None

Before — unknown filter fields were silently ignored. The resolver returned (None, False), and compile_sql_filter would swallow it, treating a filter on an unknown field as if it didn't exist (returning None):

# resolve_field returned (None, False) for unknown fields
def _resolve_episode_field(field: str) -> tuple[Any, bool] | tuple[None, bool]:
    ...
    return None, False  # ← unknown field: silently ignored

# compile_sql_filter then treated None as "no filter"
def compile_sql_filter(expr, resolve_field) -> ColumnElement[bool] | None:
    if isinstance(expr, And):
        left = compile_sql_filter(expr.left, resolve_field)
        right = compile_sql_filter(expr.right, resolve_field)
        if left is None:   # ← if one side was unknown, just skip it
            return right
        if right is None:
            return left
        return and_(left, right)

After — unknown fields raise ValueError, and compile_sql_filter always returns a valid expression (never None):

# resolve_field now raises for unknown fields
def _resolve_episode_field(field: str) -> tuple[ColumnElement, FieldEncoding]:
    ...
    raise ValueError(f"Unknown filter field: {field!r}")  # ← explicit error

# compile_sql_filter now always returns a clause, never None
def compile_sql_filter(expr, resolve_field) -> ColumnElement[bool]:
    if isinstance(expr, And):
        return and_(
            compile_sql_filter(expr.left, resolve_field),
            compile_sql_filter(expr.right, resolve_field),
        )

Before vs. After: Concrete Code Example

Scenario: Filtering on an unknown field

# BEFORE: No error. The unknown field is silently dropped.
# If you have two memories and one doesn't know about "category",
# the filter just vanishes — you get ALL records instead of an error.

results = memory.search(
    "What is my profession?",
    filter_dict={"category": "profession"},  # ← "category" unknown to this memory
    limit=10,
)
# Silently returns everything — dangerous, especially with NOT/AND/OR!
# e.g. NOT(unknown_field = x) would have matched ALL rows
# AFTER: Raises ValueError immediately — no silent data leak.

results = memory.search(
    "What is my profession?",
    filter_dict={"category": "profession"},  # ← unknown field
    limit=10,
)
# Raises: ValueError: Unknown filter field: 'category'

The integration tests in the PR demonstrate this directly:

# BEFORE: Test verified NO error occurred
await semantic_storage.get_feature_set(
    filter_expr=_expr("missing_column IN (foo)"),
)

# AFTER: Test now asserts an error IS raised
with pytest.raises(ValueError, match="Unknown filter field"):
    await semantic_storage.get_feature_set(
        filter_expr=_expr("missing_column IN (foo)"),
    )

Does the Breaking Change Require API Changes?

Yes — callers using filter_dict with unqualified or unrecognized field names will break. The PR already includes one such API-visible fix in the integration test client:

# BEFORE: bare key
filter_dict={"time": "morning"}
filter_dict={"category": "profession"}

# AFTER: must be prefixed with "m." to target the JSON metadata column
filter_dict={"m.time": "morning"}
filter_dict={"m.category": "profession"}

The m. (or metadata.) prefix is required for user-defined metadata fields. Without it, the resolver doesn't recognize the field and now raises a ValueError instead of silently passing through.

Summary of Required API Changes for Callers

Scenario Before After Action Required
Filter on user metadata field {"time": "morning"} {"m.time": "morning"} Add m. prefix
Filter on unknown field shared across memories Silently ignored ValueError raised Fix or remove the filter
NOT/AND/OR with an unknown field Unpredictably dropped ValueError raised Fix the field name
Filter on known system fields (created_at, tag, etc.) Works Works No change needed

The change is externally visible to any code that builds filter_dict queries, whether through the Python SDK or the REST API's filter parameters. Any filter using a bare key for metadata (without the m. prefix) will now fail at runtime instead of silently misbehaving.

@sscargal
Copy link
Copy Markdown
Contributor

sscargal commented Apr 7, 2026

Given this is a breaking change, we need to ensure all downstream components are updated:

@sscargal sscargal removed this from the v0.3.4 milestone Apr 7, 2026
@sscargal sscargal added this to the v0.4.0 milestone Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a unified SQLAlchemy filter compiler that can target both raw JSON fields and a new type-tagged “properties JSON” encoding, and updates storage backends/tests to treat unknown filter fields as errors (breaking behavior change).

Changes:

  • Introduce properties_json encode/decode helpers for type-tagged JSON storage (incl. timezone-preserving datetimes).
  • Expand compile_sql_filter to support "column", "json", and "properties_json" field encodings and to error on unknown fields.
  • Update SQLAlchemy + Neo4j semantic storage and tests to use the new resolver contract and enforce unknown-field errors.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/server/src/memmachine_server/semantic_memory/storage/sqlalchemy_pgvector_semantic.py Updates filter field resolution to return (ColumnElement, FieldEncoding) and raise on unknown fields.
packages/server/src/memmachine_server/semantic_memory/storage/neo4j_semantic_storage.py Adds unknown-field validation for filters via a _KNOWN_FIELDS allowlist.
packages/server/src/memmachine_server/common/vector_store/vector_store.py Adds config abstract property to VectorStoreCollection and fixes docstring return type.
packages/server/src/memmachine_server/common/properties_json.py New module for type-tagged JSON property encoding/decoding with datetime timezone preservation.
packages/server/src/memmachine_server/common/filter/sql_filter_util.py Reworks SQLAlchemy filter compilation with field encodings and typed JSON support; now raises on unknown fields.
packages/server/src/memmachine_server/common/episode_store/episode_sqlalchemy_store.py Updates episode filter resolver to new (ColumnElement, FieldEncoding) contract and raises on unknown fields.
packages/server/server_tests/memmachine_server/semantic_memory/storage/test_semantic_storage.py Updates semantic storage tests to expect errors on unknown filter fields.
packages/server/server_tests/memmachine_server/semantic_memory/storage/in_memory_semantic_storage.py Aligns in-memory filter behavior to raise on unknown fields.
packages/server/server_tests/memmachine_server/common/test_properties_json.py Adds unit tests for properties-json encode/decode round-trips and error cases.
packages/server/server_tests/memmachine_server/common/filter/test_sql_filter_util.py Expands filter compiler tests for raw JSON and properties-json (SQLite + PostgreSQL).
packages/client/client_tests/test_integration_complete.py Updates client integration tests to use m.-prefixed metadata keys in filters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +86
def _cast_json_value(
column: ColumnElement,
value: bool | float | str,
) -> ColumnElement:
"""Cast a raw JSON path element based on the Python value type."""
if isinstance(value, bool):
return column.as_boolean()
if isinstance(value, int):
return column.as_integer()
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _cast_json_value, the value parameter type annotation excludes int even though the implementation has an isinstance(value, int) branch. This makes the helper’s typing misleading and may cause type-checking issues; include int in the accepted union (and keep bool checked first to avoid bool-as-int confusion).

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +158
if isinstance(expr, In):
if not expr.values:
return false()
first_value = expr.values[0]
type_name = PROPERTY_TYPE_TO_PROPERTY_TYPE_NAME[type(first_value)]
value_path = column[PROPERTY_VALUE_KEY]
type_check = column[PROPERTY_TYPE_KEY].as_string() == type_name
if isinstance(first_value, int):
return and_(type_check, value_path.as_integer().in_(expr.values))
return and_(type_check, value_path.as_string().in_(expr.values))
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compile_properties_json_leaf handles IN by casting everything that isn’t an int as a string, and it treats bool as int due to isinstance(True, int). This can produce incorrect SQL (e.g., boolean values cast as integers, and any future float/datetime IN values compared as strings). Consider (1) explicitly handling bool before int with .as_boolean(), and (2) validating/raising for unsupported IN element types rather than silently falling back to string casting.

Copilot uses AI. Check for mistakes.
Comment on lines +1105 to +1116
_KNOWN_FIELDS: frozenset[str] = frozenset(
{
"set_id",
"set",
"semantic_category_id",
"category_name",
"category",
"tag_id",
"tag",
"feature",
"feature_name",
"value",
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_KNOWN_FIELDS includes several alias/synonym names (e.g. set, semantic_category_id, tag_id, feature_name) that don’t correspond to properties created on Feature nodes (which use set_id, category_name, tag, feature). Because _resolve_field_reference returns f"{alias}.{field}" for these, filters using the aliases will silently evaluate against missing properties and return no matches instead of behaving like other backends. Map these aliases to the actual Neo4j property names (or remove them and raise ValueError for unsupported aliases) to keep filter semantics consistent across storages.

Copilot uses AI. Check for mistakes.
@edwinyyyu
Copy link
Copy Markdown
Contributor Author

edwinyyyu commented Apr 8, 2026

Given this is a breaking change, we need to ensure all downstream components are updated:

* [ ]  Update the Python Client SDK code ([Update client libraries to support strict filter field validation and metadata prefixing for PR #1291 #1305](https://github.com/MemMachine/MemMachine/issues/1305))

* [ ]  Update the Typescript Client SDK code

* [ ]  Update Examples

* [ ]  Update documentation

* [ ]  Ensure all SDK Docstrings contain the updated information so the auto-generated `openapi.json` used by the documentation is successfully updated and accurate ([Update OpenAPI documentation and docstrings for strict filter field and prefixing changes (PR #1291) #1306](https://github.com/MemMachine/MemMachine/issues/1306))

* [ ]  Server, Client, Examples, and Documentation must be merged in the same Release to avoid breaking the front-end and back-end (causing incompatibility issues)

The TypeScript client needs no updates, but users may need to update their filters. However, the original logic was unintuitive (not consistent with any common boolean logic system) and filters are not in any of the documentation, such that we can assume anyone dependent on the logic is depending on undefined behavior. The filter is submitted to the server as a string.

More changes may be necessary to bring the client/server to a beneficial state, so I am not confident that documentation changes would be beneficial at the moment.

The Python client does not even support the filter expressions, according to Claude. Only the TypeScript client supports them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants