Skip to content

feat: add metadata filters (project, session_id, git_branch) to search#63

Open
jwk2601 wants to merge 1 commit intoobra:mainfrom
jwk2601:feat/metadata-search-filters
Open

feat: add metadata filters (project, session_id, git_branch) to search#63
jwk2601 wants to merge 1 commit intoobra:mainfrom
jwk2601:feat/metadata-search-filters

Conversation

@jwk2601
Copy link

@jwk2601 jwk2601 commented Feb 24, 2026

Summary

The exchanges table already has indexed columns for project, session_id, and git_branch, but the search API only exposes time-based filters (after/before). This PR adds metadata filtering to both the MCP tool and the CLI, enabling project-specific and branch-specific conversation searches.

Motivation: Multi-project users frequently need to scope search results to a specific project or git branch. The data and indexes already exist in the database — this change simply exposes them through the existing search interfaces.

Changes

src/search.ts

  • Extended SearchOptions interface with project, session_id, git_branch
  • Added validateMetadataFilter() — regex validation + length check, matching the existing validateISODate() pattern
  • Unified timeFilter array → general filters array for WHERE clause construction
  • Added 3x over-fetch for vector search when metadata filters are active (vec0 virtual table applies KNN before WHERE post-filter)
  • Added session_id, git_branch to SELECT columns and result mapping

src/mcp-server.ts

  • Added Zod schema entries for new parameters (with min(1) + .optional())
  • Added JSON schema entries in ListToolsRequestSchema handler
  • Passed new parameters through in both single and multi-concept search paths

src/search-cli.ts

  • Added --project, --session-id, --git-branch CLI flags
  • Updated help text with new options and examples

test/integration.test.ts

  • Added describe('Metadata Filtering') block with 9 tests:
    • Filter by project / non-matching project returns empty / filter by git_branch / filter by session_id
    • Combined metadata + time filters
    • SQL injection rejection / overly long input rejection
    • Vector search mode with metadata filter
    • Backward compatibility (no filter = same results)

Design decisions

  • Input validation: Follows the existing pattern of regex validation + string interpolation (same as validateISODate). A full parameterized query refactor is out of scope for this PR.
  • Over-fetch factor (3x): Conservative multiplier to compensate for vec0's KNN-first behavior when metadata filters reduce the result set. The results are trimmed back to the requested limit after filtering.
  • Exact match only: All three filters use = (exact match). Prefix/fuzzy matching can be added in a follow-up if needed.

Backward compatibility

All new parameters are optional. When omitted, behavior is identical to the current version — the filters array remains empty and produces the same SQL as the previous timeFilter approach.

Test plan

  • npm run build — compiles without errors
  • npm test — all new tests pass; no regressions in existing tests
  • Existing tests with no filters behave identically

Summary by CodeRabbit

Release Notes

  • New Features
    • Added metadata filtering to search: users can now filter results by project name, session ID, and git branch.
    • Search results now include session ID and git branch information for better context and traceability.
    • CLI now supports --project, --session-id, and --git-branch filter options for refined queries.
    • Input validation ensures filter values meet safety requirements.

The exchanges table already has indexed columns for project, session_id,
and git_branch, but the search API only exposes time-based filters
(after/before). This adds metadata filtering to enable project-specific
and branch-specific searches.

Changes:
- SearchOptions: add project, session_id, git_branch fields
- search.ts: extend WHERE clause for both vector and text search
- search.ts: validate metadata inputs (regex + length check)
- search.ts: over-fetch 3x for vector search with metadata filters
  (vec0 applies KNN before WHERE post-filter)
- search.ts: include session_id, git_branch in SELECT and result mapping
- mcp-server.ts: add Zod schema + JSON schema for new parameters
- search-cli.ts: add --project, --session-id, --git-branch flags
- integration.test.ts: 9 new tests for metadata filtering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

This pull request adds optional metadata filtering capabilities (project, session_id, git_branch) across the system stack. The filters are introduced in the MCP server schemas, CLI arguments, and search implementation, with corresponding validation and database query integration.

Changes

Cohort / File(s) Summary
MCP Server Schemas
src/mcp-server.ts
Added optional filter fields (project, session_id, git_branch) to SearchInputSchema and ListToolsRequestSchema; passed through these fields in CallToolRequestSchema handling for both multi-concept and single-concept search paths.
CLI Interface
src/search-cli.ts
Added CLI argument parsing for --project, --session-id, and --git-branch flags with corresponding internal variables; updated help text and extended options passed to search functions.
Search Implementation
src/search.ts
Introduced metadata filter validation via validateMetadataFilter function; extended query construction with combined filterClause for metadata and time constraints; adjusted vector search logic with effectiveK over-fetch strategy for metadata filtering; added session_id and git_branch to SELECT statements and result mapping as sessionId/gitBranch fields.
Test Suite
test/integration.test.ts
Exported SearchOptions type and added comprehensive "Metadata Filtering" test suite covering project, git_branch, session_id filtering, combinations with time filters, input validation, and vector/text search modes.

Sequence Diagram

sequenceDiagram
    actor CLI
    participant MCP as MCP Server
    participant Search as Search Engine
    participant DB as Database
    
    CLI->>MCP: search with metadata filters<br/>(project, session_id, git_branch)
    MCP->>MCP: validate filter fields<br/>against schema
    MCP->>Search: searchConversations(queries,<br/>options with filters)
    Search->>Search: validateMetadataFilter<br/>for each filter
    Search->>Search: construct combined<br/>filterClause with<br/>metadata + time constraints
    Search->>DB: SELECT with filterClause<br/>+ effectiveK (limit × 3)
    DB-->>Search: rows with session_id,<br/>git_branch metadata
    Search->>Search: post-filter results<br/>to requested limit
    Search->>Search: map session_id/git_branch<br/>to sessionId/gitBranch
    Search-->>MCP: filtered results with<br/>metadata fields
    MCP-->>CLI: formatted results
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A whisker-twitch of joy!
New filters bloom like clover in the search—
Project, branch, and session dance as one,
With validation guards and queries well-merged.
The data finds its way, precise and bright! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding metadata filters (project, session_id, git_branch) to search functionality across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/search-cli.ts (1)

69-74: Guard against missing values for new flags.

If a user forgets the value, the parser will swallow the next flag (or set undefined) silently. Consider validating the next token and exiting with a helpful error (same guard can optionally apply to other flags too).

Suggested tweak
  } else if (arg === '--project') {
-    project = args[++i];
+    const value = args[++i];
+    if (!value || value.startsWith('--')) {
+      console.error('Missing value for --project');
+      process.exit(1);
+    }
+    project = value;
  } else if (arg === '--session-id') {
-    sessionId = args[++i];
+    const value = args[++i];
+    if (!value || value.startsWith('--')) {
+      console.error('Missing value for --session-id');
+      process.exit(1);
+    }
+    sessionId = value;
  } else if (arg === '--git-branch') {
-    gitBranch = args[++i];
+    const value = args[++i];
+    if (!value || value.startsWith('--')) {
+      console.error('Missing value for --git-branch');
+      process.exit(1);
+    }
+    gitBranch = value;
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/search-cli.ts` around lines 69 - 74, The flag parsing for --project,
--session-id, and --git-branch (variables project, sessionId, gitBranch) can
consume a missing value or the next flag; update the argument-parsing logic to
validate the next token after each flag exists and is not another flag (e.g.,
not starting with '-') before assigning args[++i]; if validation fails, print a
clear error like "Missing value for --project/--session-id/--git-branch" and
exit (or throw) so the CLI fails fast; apply the same guard pattern used
elsewhere in the parser loop so other flags are protected too.
src/search.ts (1)

58-65: Prefer parameterized filters over string interpolation.

Even with regex validation, binding parameters is more robust and future-proof (e.g., if validation rules ever loosen).

Possible refactor (parameterized filters)
-  const filters: string[] = [];
-  if (after) filters.push(`e.timestamp >= '${after}'`);
-  if (before) filters.push(`e.timestamp <= '${before}'`);
-  if (project) filters.push(`e.project = '${project}'`);
-  if (session_id) filters.push(`e.session_id = '${session_id}'`);
-  if (git_branch) filters.push(`e.git_branch = '${git_branch}'`);
-  const filterClause = filters.length > 0 ? `AND ${filters.join(' AND ')}` : '';
+  const filters: string[] = [];
+  const filterParams: string[] = [];
+  if (after) { filters.push('e.timestamp >= ?'); filterParams.push(after); }
+  if (before) { filters.push('e.timestamp <= ?'); filterParams.push(before); }
+  if (project) { filters.push('e.project = ?'); filterParams.push(project); }
+  if (session_id) { filters.push('e.session_id = ?'); filterParams.push(session_id); }
+  if (git_branch) { filters.push('e.git_branch = ?'); filterParams.push(git_branch); }
+  const filterClause = filters.length > 0 ? `AND ${filters.join(' AND ')}` : '';

-    results = stmt.all(
-      Buffer.from(new Float32Array(queryEmbedding).buffer),
-      effectiveK
-    );
+    results = stmt.all(
+      Buffer.from(new Float32Array(queryEmbedding).buffer),
+      effectiveK,
+      ...filterParams
+    );

-    const textResults = textStmt.all(`%${query}%`, `%${query}%`, limit);
+    const textResults = textStmt.all(`%${query}%`, `%${query}%`, ...filterParams, limit);

Also applies to: 93-100, 124-130

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/search.ts` around lines 58 - 65, The code builds SQL filterClause via
string interpolation using variables (after, before, project, session_id,
git_branch) stored in the filters array and filterClause — change this to use
parameterized bindings: instead of embedding values into filters, push condition
templates like "e.timestamp >= $1" (or "?" depending on your DB client) and
collect corresponding values into a params array, then join conditions into
filterClause and pass params to the query execution; apply the same refactor to
the other similar blocks mentioned (the filters usage around the sections that
build filters at 93-100 and 124-130) so all dynamic values are bound rather than
interpolated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/search-cli.ts`:
- Around line 69-74: The flag parsing for --project, --session-id, and
--git-branch (variables project, sessionId, gitBranch) can consume a missing
value or the next flag; update the argument-parsing logic to validate the next
token after each flag exists and is not another flag (e.g., not starting with
'-') before assigning args[++i]; if validation fails, print a clear error like
"Missing value for --project/--session-id/--git-branch" and exit (or throw) so
the CLI fails fast; apply the same guard pattern used elsewhere in the parser
loop so other flags are protected too.

In `@src/search.ts`:
- Around line 58-65: The code builds SQL filterClause via string interpolation
using variables (after, before, project, session_id, git_branch) stored in the
filters array and filterClause — change this to use parameterized bindings:
instead of embedding values into filters, push condition templates like
"e.timestamp >= $1" (or "?" depending on your DB client) and collect
corresponding values into a params array, then join conditions into filterClause
and pass params to the query execution; apply the same refactor to the other
similar blocks mentioned (the filters usage around the sections that build
filters at 93-100 and 124-130) so all dynamic values are bound rather than
interpolated.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6feaa5b and 31e1998.

⛔ Files ignored due to path filters (4)
  • dist/mcp-server.js is excluded by !**/dist/**
  • dist/search-cli.js is excluded by !**/dist/**
  • dist/search.d.ts is excluded by !**/dist/**
  • dist/search.js is excluded by !**/dist/**
📒 Files selected for processing (4)
  • src/mcp-server.ts
  • src/search-cli.ts
  • src/search.ts
  • test/integration.test.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant