[ENH] recognize and flush new metadata keys to schema on local compaction #5728

jairad26 · 2025-10-23T16:23:03Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- This PR adds support for discovering new metadata keys during log flushing for local chroma
New functionality
- ...

Test plan

added e2e test support to run for distributed, local, and single node to ensure metadata discovery and related tests run
How are these changes tested?

[x ] Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

jairad26 · 2025-10-23T16:23:23Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-10-23T16:23:31Z

propel-code-bot · 2025-10-24T18:55:37Z

Dynamic Schema Discovery & Persistence During Local Compaction

Adds end-to-end support for discovering previously unseen metadata keys when processing local log compaction. rust/segment now detects new keys while flushing logs, mutates an in-memory Schema, and returns an ApplyLogsOutcome that signals if the schema changed. rust/log persists that new schema (when the collection already had a schema) and keeps metadata/segment updates in sync. The PR also hardens schema rules (blocks keys starting with CHROMA_), refactors error types, bumps deps (adds serde_json), and updates the large Python E2E suite to cope with conditional SPANN behaviour.

Key Changes

• Extended SqliteMetadataWriter.apply_logs() signature: new schema: Option<Schema> param and returns ApplyLogsOutcome { schema_update }
• Introduced ApplyLogsOutcome struct and helper ensure_schema_for_update_value() to auto-populate Schema on new key discovery while filtering system keys (CHROMA_KEY)
• Added update_collection_schema() method in SqliteMetadataWriter for in-place schema persistence (includes JSON serialization error handling)
• Updated LocalCompactionManager logic: tracks whether a schema was already persisted, writes logs, updates schema inside same backfill flow
• Strengthened Schema.ensure_key_from_metadata() guard against special keys; added CHROMA_KEY constant
• Cargo: added serde_json to chroma-segment
• Large test-suite rewrite (chromadb/test/api/test_schema_e2e.py) to adapt to new conditional schema persistence paths; removed SPANN-skips and performs runtime checks

Affected Areas

• rust/segment/src/sqlite_metadata.rs
• rust/log/src/local_compaction_manager.rs
• rust/types/src/collection_schema.rs
• Python e2e schema tests
• rust/segment/Cargo.toml dependency list

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-10-24T23:23:16Z

rust/segment/src/sqlite_metadata.rs

+                if let Some(schema_mut) = schema.as_mut() {
+                    if let Ok(metadata_value) = MetadataValue::try_from(value) {


[BestPractice]

Error handling gap: MetadataValue::try_from(value) failure is silently ignored in schema discovery logic. If the conversion fails, the key won't be added to the schema even though it might be valid metadata that should be discoverable.

if let Ok(metadata_value) = MetadataValue::try_from(value) { return schema_mut .ensure_key_from_metadata(key, metadata_value.value_type()); } else { // Log the conversion failure for debugging - using standard tracing pattern tracing::warn!(key = %key, "Failed to convert metadata value during schema discovery"); }

This could lead to inconsistent schema discovery where some valid metadata keys are missed due to conversion issues. The enhanced logging follows tracing best practices with structured fields.

Context for Agents

[**BestPractice**] Error handling gap: `MetadataValue::try_from(value)` failure is silently ignored in schema discovery logic. If the conversion fails, the key won't be added to the schema even though it might be valid metadata that should be discoverable. ```rust if let Ok(metadata_value) = MetadataValue::try_from(value) { return schema_mut .ensure_key_from_metadata(key, metadata_value.value_type()); } else { // Log the conversion failure for debugging - using standard tracing pattern tracing::warn!(key = %key, "Failed to convert metadata value during schema discovery"); } ``` This could lead to inconsistent schema discovery where some valid metadata keys are missed due to conversion issues. The enhanced logging follows tracing best practices with structured fields. File: rust/segment/src/sqlite_metadata.rs Line: 319

propel-code-bot · 2025-10-27T17:47:46Z

rust/segment/src/sqlite_metadata.rs

+            runtime.block_on(sqlite_seg_writer.apply_logs(
+                data,
+                metadata_seg_id,
+                test_data
+                    .collection_and_segments
+                    .collection
+                    .schema
+                    .clone(),
+                &mut *tx,
+            ))
+            .expect("Should be able to apply logs");


[BestPractice]

The apply_logs method signature change adds a new schema parameter, but in the test cases you're passing the collection's schema from test_data. However, the production code in local_compaction_manager.rs passes either the collection's schema or None based on schema_previously_persisted. This inconsistency could lead to different behavior between tests and production.

Consider aligning the test calls with the production logic:

runtime.block_on(sqlite_seg_writer.apply_logs( data, metadata_seg_id, if schema_previously_persisted { test_data.collection_and_segments.collection.schema.clone() } else { None }, &mut *tx, ))

Context for Agents

[**BestPractice**] The `apply_logs` method signature change adds a new schema parameter, but in the test cases you're passing the collection's schema from `test_data`. However, the production code in `local_compaction_manager.rs` passes either the collection's schema or None based on `schema_previously_persisted`. This inconsistency could lead to different behavior between tests and production. Consider aligning the test calls with the production logic: ```rust runtime.block_on(sqlite_seg_writer.apply_logs( data, metadata_seg_id, if schema_previously_persisted { test_data.collection_and_segments.collection.schema.clone() } else { None }, &mut *tx, )) ``` File: rust/segment/src/sqlite_metadata.rs Line: 999

propel-code-bot · 2025-10-27T17:49:51Z

rust/log/src/local_compaction_manager.rs

+                metadata_writer
+                    .update_collection_schema(
+                        collection_and_segments.collection.collection_id,
+                        &updated_schema,
+                    )
+                    .await
+                    .map_err(|_| CompactionManagerError::MetadataApplyLogsFailed)?;


[BestPractice]

Missing error handling for schema update: If update_collection_schema() fails (lines 228-234), the error is converted to MetadataApplyLogsFailed which loses important context about what specifically went wrong during schema persistence. This could make debugging schema-related issues very difficult.

// Consider more specific error handling metadata_writer .update_collection_schema( collection_and_segments.collection.collection_id, &updated_schema, ) .await .map_err(|e| CompactionManagerError::SchemaUpdateFailed(e))?;

Context for Agents

[**BestPractice**] Missing error handling for schema update: If `update_collection_schema()` fails (lines 228-234), the error is converted to `MetadataApplyLogsFailed` which loses important context about what specifically went wrong during schema persistence. This could make debugging schema-related issues very difficult. ```rust // Consider more specific error handling metadata_writer .update_collection_schema( collection_and_segments.collection.collection_id, &updated_schema, ) .await .map_err(|e| CompactionManagerError::SchemaUpdateFailed(e))?; ``` File: rust/log/src/local_compaction_manager.rs Line: 234

rust/log/src/local_compaction_manager.rs

propel-code-bot · 2025-10-27T22:55:00Z

rust/segment/src/sqlite_metadata.rs

+                            for (key, value) in meta.iter() {
+                                if Self::ensure_schema_for_update_value(&mut schema, key, value) {
+                                    schema_modified = true;
+                                }
+                            }


[PerformanceOptimization]

Potential performance issue: The ensure_schema_for_update_value function is called for every key-value pair in every metadata update within the log processing loop. For large batches with many metadata fields, this could create significant overhead.

Consider batching schema updates or using a HashSet to track which keys have already been processed to avoid redundant schema checks. For better performance, pre-size the HashSet if you can estimate the number of unique keys:

// Pre-size for better performance let mut processed_keys = HashSet::with_capacity(estimated_unique_keys); for (key, value) in meta.iter() { if !processed_keys.contains(key) { if Self::ensure_schema_for_update_value(&mut schema, key, value) { schema_modified = true; processed_keys.insert(key.clone()); } } }

Note: HashSet operations are O(1) average case, but the overhead of hashing and set operations may not be beneficial for very small key sets. Consider the trade-off based on typical batch sizes in your use case.

Context for Agents

[**PerformanceOptimization**] Potential performance issue: The `ensure_schema_for_update_value` function is called for every key-value pair in every metadata update within the log processing loop. For large batches with many metadata fields, this could create significant overhead. Consider batching schema updates or using a `HashSet` to track which keys have already been processed to avoid redundant schema checks. For better performance, pre-size the HashSet if you can estimate the number of unique keys: ```rust // Pre-size for better performance let mut processed_keys = HashSet::with_capacity(estimated_unique_keys); for (key, value) in meta.iter() { if !processed_keys.contains(key) { if Self::ensure_schema_for_update_value(&mut schema, key, value) { schema_modified = true; processed_keys.insert(key.clone()); } } } ``` Note: HashSet operations are O(1) average case, but the overhead of hashing and set operations may not be beneficial for very small key sets. Consider the trade-off based on typical batch sizes in your use case. File: rust/segment/src/sqlite_metadata.rs Line: 386

propel-code-bot · 2025-10-28T18:01:18Z

rust/segment/src/sqlite_metadata.rs

+        schema: Option<Schema>,
        tx: &mut C,
-    ) -> Result<(), SqliteMetadataError>
+    ) -> Result<ApplyLogsOutcome, SqliteMetadataError>
    where
        for<'connection> &'connection mut C: sqlx::Executor<'connection, Database = sqlx::Sqlite>,
    {
        if logs.is_empty() {
-            return Ok(());
+            return Ok(ApplyLogsOutcome {
+                schema_update: None,
+            });
        }
+        let mut schema = schema;


[CriticalError]

The changed apply_logs method signature adds a schema parameter but doesn't handle errors when the schema is None and metadata processing tries to access it. In ensure_schema_for_update_value, if schema.as_mut() returns None, the function silently returns false, but the caller expects schema discovery to work.

Consider adding error handling or initialization:

let mut schema = schema.unwrap_or_else(|| Schema::new());

This ensures schema discovery can always function properly during log flushing.

Context for Agents

[**CriticalError**] The changed `apply_logs` method signature adds a `schema` parameter but doesn't handle errors when the schema is `None` and metadata processing tries to access it. In `ensure_schema_for_update_value`, if `schema.as_mut()` returns `None`, the function silently returns `false`, but the caller expects schema discovery to work. Consider adding error handling or initialization: ```rust let mut schema = schema.unwrap_or_else(|| Schema::new()); ``` This ensures schema discovery can always function properly during log flushing. File: rust/segment/src/sqlite_metadata.rs Line: 345

propel-code-bot · 2025-10-28T18:27:04Z

rust/log/src/local_compaction_manager.rs

            .sysdb
            .get_collection_with_segments(message.collection_id)
            .await?;
+        let schema_previously_persisted = collection_and_segments.collection.schema.is_some();


[CriticalError]

Potential Race Condition in Schema State Check: The schema_previously_persisted boolean is determined before schema reconciliation, but the schema could be modified during reconcile_schema_with_config(). This creates a potential inconsistency where:

Schema starts as None (schema_previously_persisted = false)

reconcile_schema_with_config() creates/modifies schema

Later logic uses the stale schema_previously_persisted value

This could lead to schema updates being skipped when they should be persisted.

// Current approach - schema state checked before reconciliation let schema_previously_persisted = collection_and_segments.collection.schema.is_some(); collection_and_segments .collection .reconcile_schema_with_config(KnnIndex::Hnsw)?; // Later usage may be inconsistent with actual schema state if schema_previously_persisted { /* ... */ }

Consider checking schema state after reconciliation or using a different approach to determine when schema updates should be persisted.

Context for Agents

[**CriticalError**] **Potential Race Condition in Schema State Check**: The `schema_previously_persisted` boolean is determined before schema reconciliation, but the schema could be modified during `reconcile_schema_with_config()`. This creates a potential inconsistency where: 1. Schema starts as `None` (schema_previously_persisted = false) 2. `reconcile_schema_with_config()` creates/modifies schema 3. Later logic uses the stale `schema_previously_persisted` value This could lead to schema updates being skipped when they should be persisted. ```rust // Current approach - schema state checked before reconciliation let schema_previously_persisted = collection_and_segments.collection.schema.is_some(); collection_and_segments .collection .reconcile_schema_with_config(KnnIndex::Hnsw)?; // Later usage may be inconsistent with actual schema state if schema_previously_persisted { /* ... */ } ``` Consider checking schema state after reconciliation or using a different approach to determine when schema updates should be persisted. File: rust/log/src/local_compaction_manager.rs Line: 143

this is intentional. if theres no schema_str found in the db, we shouldnt use&modify it in compaction

…tion

This was referenced Oct 23, 2025

[ENH] Embed query strings in search api #5599

Merged

[ENH] Add local support for schema #5714

Merged

jairad26 mentioned this pull request Oct 23, 2025

[BUG] Default create path with no config or schema does not populate default ef in schema #5726

Merged

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from b139cb0 to 6162b25 Compare October 23, 2025 18:15

jairad26 force-pushed the jai/local-schema-support branch from 9ff8b57 to f2c2971 Compare October 23, 2025 18:15

blacksmith-sh bot deleted a comment from jairad26 Oct 23, 2025

jairad26 force-pushed the jai/local-schema-support branch from f2c2971 to edc07ff Compare October 23, 2025 20:08

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from 6162b25 to 25cc1e2 Compare October 23, 2025 20:08

jairad26 force-pushed the jai/local-schema-support branch from edc07ff to c2e5338 Compare October 23, 2025 20:11

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from 25cc1e2 to 5f6b983 Compare October 23, 2025 20:11

jairad26 force-pushed the jai/local-schema-support branch from c2e5338 to 67ee627 Compare October 23, 2025 20:43

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch 3 times, most recently from fee4f7e to c8d1659 Compare October 23, 2025 22:33

jairad26 force-pushed the jai/local-schema-support branch 2 times, most recently from 5e2d880 to a0e7c9c Compare October 23, 2025 23:17

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch 2 times, most recently from 9fdfe9a to b1db218 Compare October 23, 2025 23:25

jairad26 force-pushed the jai/local-schema-support branch 2 times, most recently from 9c47af4 to 7e9e084 Compare October 23, 2025 23:46

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch 2 times, most recently from 3ff589e to ac13295 Compare October 23, 2025 23:58

jairad26 force-pushed the jai/local-schema-support branch from 7e9e084 to 387c681 Compare October 23, 2025 23:58

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from ac13295 to 038041f Compare October 24, 2025 00:22

jairad26 force-pushed the jai/local-schema-support branch from 387c681 to 17e078b Compare October 24, 2025 00:22

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from 038041f to d2242b5 Compare October 24, 2025 06:37

jairad26 force-pushed the jai/local-schema-support branch 2 times, most recently from 32d4b95 to 1ce0bbb Compare October 24, 2025 06:48

jairad26 mentioned this pull request Oct 24, 2025

[ENH] Export schema and search types from chromadb.api #5736

Merged

jairad26 marked this pull request as ready for review October 24, 2025 18:54

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from 38172c2 to 09a866f Compare October 24, 2025 23:19

jairad26 force-pushed the jai/local-schema-support branch from bee55b4 to 296af1b Compare October 24, 2025 23:19

propel-code-bot bot reviewed Oct 24, 2025

View reviewed changes

jairad26 force-pushed the jai/local-schema-support branch from 296af1b to dd3f346 Compare October 27, 2025 17:44

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from 09a866f to 67f3870 Compare October 27, 2025 17:44

propel-code-bot bot reviewed Oct 27, 2025

View reviewed changes

rust/log/src/local_compaction_manager.rs Show resolved Hide resolved

jairad26 force-pushed the jai/local-schema-support branch from dd3f346 to df8c8c8 Compare October 27, 2025 19:47

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from 67f3870 to a43db01 Compare October 27, 2025 19:47

jairad26 changed the base branch from jai/local-schema-support to graphite-base/5728 October 27, 2025 22:49

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from a43db01 to 2dd65c5 Compare October 27, 2025 22:49

jairad26 force-pushed the graphite-base/5728 branch from df8c8c8 to 998da94 Compare October 27, 2025 22:49

jairad26 changed the base branch from graphite-base/5728 to main October 27, 2025 22:50

propel-code-bot bot reviewed Oct 27, 2025

View reviewed changes

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch 3 times, most recently from ed52a09 to f9c8123 Compare October 28, 2025 17:56

propel-code-bot bot reviewed Oct 28, 2025

View reviewed changes

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch 2 times, most recently from 8f4f55e to 7826dad Compare October 28, 2025 18:22

propel-code-bot bot reviewed Oct 28, 2025

View reviewed changes

[ENH] recognize and flush new metadata keys to schema on local compac…

5bc00ac

…tion

jairad26 force-pushed the jai/add-new-schema-keys-local-compaction branch from 7826dad to 5bc00ac Compare October 28, 2025 18:44

sanketkedia approved these changes Oct 28, 2025

View reviewed changes

jairad26 enabled auto-merge (squash) October 28, 2025 19:00

jairad26 merged commit a1ea81a into main Oct 28, 2025
62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] recognize and flush new metadata keys to schema on local compaction #5728

[ENH] recognize and flush new metadata keys to schema on local compaction #5728

Uh oh!

jairad26 commented Oct 23, 2025 •

edited

Loading

Uh oh!

jairad26 commented Oct 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

propel-code-bot bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

propel-code-bot bot Oct 24, 2025

Uh oh!

propel-code-bot bot Oct 27, 2025

Uh oh!

propel-code-bot bot Oct 27, 2025

Uh oh!

Uh oh!

propel-code-bot bot Oct 27, 2025

Uh oh!

propel-code-bot bot Oct 28, 2025

Uh oh!

propel-code-bot bot Oct 28, 2025

Uh oh!

jairad26 Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if let Some(schema_mut) = schema.as_mut() {
		if let Ok(metadata_value) = MetadataValue::try_from(value) {

[ENH] recognize and flush new metadata keys to schema on local compaction #5728

[ENH] recognize and flush new metadata keys to schema on local compaction #5728

Uh oh!

Conversation

jairad26 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

jairad26 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 23, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

propel-code-bot bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

jairad26 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jairad26 commented Oct 23, 2025 •

edited

Loading

jairad26 commented Oct 23, 2025 •

edited

Loading

propel-code-bot bot commented Oct 24, 2025 •

edited

Loading