feat: constraints #679

michdr · 2025-11-26T10:16:37Z

Note

Adds first-class constraints to ensure valid value combinations and inequalities in synthetic data.

Introduces typed constraints (FixedCombinations, Inequality) and validators; new Constraint, ConstraintConfig, and ConstraintType in domain.py
New transformation pipeline (_data/constraints/transformations.py) to merge/split columns and compute deltas; updates tgt-data parquet and encoding-types.json during training
Hooks into local pipeline: preprocess constraints in step_pull_training_data.py and revert internal columns post-generation in step_generate_data.py
Constraint handlers: FixedCombinationsHandler (JSON-merged categorical) and InequalityHandler (numeric/datetime delta, NA-safe)
SDK API docs updated with constraint usage example; local generator config now preserves random_state and constraints
Minor: suppress sklearn PCA RuntimeWarnings in model report; add auto column extraction from base64 data
Tests: unit tests for handlers/translator and an end-to-end constraints test

^{Written by Cursor Bugbot for commit 4777ff0. This will update automatically on new commits. Configure here.}

- replace '|' separator with record separator (\x1E) to avoid collisions - implement escaping mechanism: double separator if it appears in data - add validation for malformed splits with graceful error handling - update tests to cover edge cases with separator characters - remove backward compatibility as not needed

mostlyai/sdk/_data/constraints/transformations.py

mostlyai/sdk/domain.py

mostlyai/sdk/_data/constraint_transformations.py

mplatzer · 2026-01-07T09:17:13Z

mostlyai/sdk/_data/constraint_transformations.py

+    key = "|".join(columns)
+    hash_suffix = hashlib.md5(key.encode()).hexdigest()[:8]
+    columns_str = "_".join(col.upper() for col in columns)
+    return f"__TABULAR_CONSTRAINT_{prefix}_{columns_str}_{hash_suffix}__"


why not just __CONSTRAINT_ rather than __TABULAR_CONSTRAINT_?

the initial thought was to be specific about the model, as well. Doesn't have to be, and might be meaningless in possible future constraint types (e.g. cross table).

mplatzer · 2026-01-07T10:44:59Z

mostlyai/sdk/_data/constraint_transformations.py

+            )
+
+
+class FixedCombinationHandler(ConstraintHandler):


should we separate out constraints into a dedicated directory, with each constraint type being defined in their own file?

why not. With only 2 constraint types it's quite compact. Once we add a few more, I can see the benefit of separate modules.

mplatzer · 2026-01-07T10:51:15Z

mostlyai/sdk/domain.py

+                # presumably not initialized yet, so we skip this validation
+                continue
+
+            if isinstance(typed_constraint, FixedCombination):


can't we generalize this? i.e. can't the constraint handler itself register that method?

maybe the easiest would be to add a validation method as part of an interface, which would be the base for FixedCombination & Inequality

mplatzer · 2026-01-07T10:51:51Z

mostlyai/sdk/domain.py

+
+        return self
+
+    def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx):


this feels like it should belong to the same file, where we defined the FixedCombination constraint

yes, good point! 👍🏼

mplatzer · 2026-01-07T10:52:54Z

tools/custom_template/pydantic_v2/BaseModel.jinja2

+
+        return self
+
+    def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx):


as mentioned, this feels out of place here.

mplatzer · 2026-01-07T10:53:15Z

tools/model.py

+
+        return self
+
+    def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx):


as mentioned, this feels out of place here

mostlyai/sdk/_data/constraints/types/inequality.py

mostlyai/sdk/_data/constraints/types/fixed_combinations.py

mostlyai/sdk/_data/constraints/types/inequality.py

mostlyai/sdk/_data/constraints/types/fixed_combinations.py

mostlyai/sdk/_data/constraints/types/inequality.py

mostlyai/sdk/_data/constraints/types/fixed_combinations.py

cursor · 2026-01-08T11:57:23Z

mostlyai/sdk/_data/constraints/types/inequality.py

+                col.model_encoding_type in datetime_encodings
+                for col in table.columns
+                if col.name in {self.low_column, self.high_column}
+            )


Empty generator in all() incorrectly returns True

Medium Severity

The _is_datetime detection uses all() on a generator expression that filters columns by name. If no columns match the filter (i.e., neither low_column nor high_column is found in table.columns), the generator is empty and all([]) returns True. This would incorrectly set _is_datetime = True for numeric constraints, causing the handler to treat numeric data as datetime and apply incorrect epoch-based transformations. While validation typically catches missing columns, this is still a logical error that could cause data corruption if validation is bypassed.

cursor · 2026-01-08T11:57:23Z

mostlyai/sdk/_data/constraints/types/fixed_combinations.py

+                    return ["_RARE_"] * len(self.columns)
+                try:
+                    values = json.loads(merged_value)
+                    return [str(v) if v is not None else "" for v in values]


NA values converted to empty strings in round-trip

Medium Severity

In to_original, the split_row function converts None values (which represent original NA/null values in the JSON-serialized data) to empty strings "" instead of preserving them as proper null values. The expression str(v) if v is not None else "" loses null information. When categorical columns contain NA values, they will become empty strings after the round-trip transformation, causing data loss. The same issue exists on line 67 where all columns get empty strings when merged_value is NA.

cursor · 2026-01-08T11:57:23Z

mostlyai/sdk/_data/constraints/types/fixed_combinations.py

+                    return ["_RARE_"] * len(self.columns)
+                try:
+                    values = json.loads(merged_value)
+                    return [str(v) if v is not None else "" for v in values]


Missing element count validation causes crash on malformed data

Medium Severity

The split_row function in to_original parses JSON values but doesn't validate that the number of elements matches len(self.columns). If the synthetic model generates a malformed merged value with fewer elements than expected (e.g., '["a"]' when 3 columns are expected), the resulting split_df will have fewer columns. When the loop accesses split_df[i] where i exceeds the actual column count, a KeyError is raised. This could crash the generation pipeline when processing synthetic data.

cursor · 2026-01-08T11:57:23Z

mostlyai/sdk/domain.py


    append = "APPEND"
-    replace = "REPLACE"
+    replace_ = "REPLACE"


Enum member renamed breaks public API compatibility

High Severity

The IfExists enum member was renamed from replace to replace_. This is a breaking API change that will cause an AttributeError for any code using IfExists.replace. External consumers of this SDK who use this enum member for connector write operations will experience runtime failures after upgrading.

mostlyai/sdk/_data/constraints/types/inequality.py

mostlyai/sdk/_data/constraints/types/fixed_combinations.py

mostlyai/sdk/_data/constraints/types/inequality.py

michdr and others added 30 commits November 26, 2025 09:25

feat: constraints (WIP)

2c6540f

wip

6167efe

Merge branch 'main' into feat-constraints

d21f530

wip

ddb5792

wip

4ebb24c

simplify + test

c17ba16

fix

150db15

add docstring

6db750e

wip

404bf29

Merge branch 'main' into feat-constraints

4efcb65

rm preprocess_constraints step

fab7655

Merge branch 'main' into feat-constraints

c776180

don't mutate generator

cef67e4

fix model report columns

9c58d2f

wip

b09ba93

keep original columns

9c83af4

wip

f948e2d

add constraint types

d03f8f7

wip

187f811

add onehot

4326767

add strict boundaries

e3dd47d

Merge branch 'main' into feat-constraints

4f2ca23

fix: keep columns

2e93b86

validate no conflicting inequality

a3f5dca

validate columns' existence

8788132

consider seed (imputation); better reconstruction logic for inequality

799e0ea

suppress sklearn PCA RuntimeWarnings

1dd96f9

fix (un)escpaing; to be revisited

4eb408f

bug fixes

658b05f

fix e2e test

d90ef35

cursor bot reviewed Dec 18, 2025

View reviewed changes

mostlyai/sdk/_data/constraints/transformations.py Show resolved Hide resolved

michdr added 2 commits December 18, 2025 14:25

fix misc

8e9dcc2

wip

f43429b

cursor bot reviewed Dec 29, 2025

View reviewed changes

mostlyai/sdk/domain.py Show resolved Hide resolved

upd api part 1

9345df8

cursor bot reviewed Dec 29, 2025

View reviewed changes

mostlyai/sdk/_data/constraint_transformations.py Outdated Show resolved Hide resolved

michdr added 5 commits December 29, 2025 15:55

rm remaining validations (for now)

b3567e3

wip

5585ff4

adapt test

21ced4e

wip

45669d9

accept both cases for constraint config

48f0ef8

mplatzer reviewed Jan 7, 2026

View reviewed changes

michdr and others added 2 commits January 7, 2026 13:57

Merge branch 'main' into feat-constraints

9ffc4ff

improvements part 1

f8643ef

cursor bot reviewed Jan 7, 2026

View reviewed changes

mostlyai/sdk/_data/constraints/types/inequality.py Show resolved Hide resolved

mostlyai/sdk/_data/constraints/types/fixed_combinations.py Show resolved Hide resolved

improvements part 2

672bf14

cursor bot reviewed Jan 7, 2026

View reviewed changes

mostlyai/sdk/_data/constraints/types/inequality.py Show resolved Hide resolved

mostlyai/sdk/_data/constraints/types/fixed_combinations.py Show resolved Hide resolved

refactor

32711fc

cursor bot reviewed Jan 8, 2026

View reviewed changes

mostlyai/sdk/_data/constraints/types/inequality.py Show resolved Hide resolved

mostlyai/sdk/_data/constraints/types/fixed_combinations.py Show resolved Hide resolved

mostlyai/sdk/_data/constraints/types/fixed_combinations.py Show resolved Hide resolved

michdr requested a review from mplatzer January 8, 2026 10:09

FixedCombinations

b470651

cursor bot reviewed Jan 8, 2026

View reviewed changes

ensure inequality constraint

4a2bf7a

cursor bot reviewed Jan 8, 2026

View reviewed changes

mostlyai/sdk/_data/constraints/types/inequality.py Show resolved Hide resolved

mostlyai/sdk/_data/constraints/types/fixed_combinations.py Show resolved Hide resolved

michdr added 2 commits January 8, 2026 15:00

refine inequality reconstruction rules

956624e

misc

4777ff0

cursor bot reviewed Jan 8, 2026

View reviewed changes

mostlyai/sdk/_data/constraints/types/fixed_combinations.py Show resolved Hide resolved

mostlyai/sdk/_data/constraints/types/inequality.py Show resolved Hide resolved

mplatzer merged commit 066f35a into main Jan 8, 2026
14 checks passed

mplatzer deleted the feat-constraints branch January 8, 2026 14:46


		return self

		def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx):

		)


		class FixedCombinationHandler(ConstraintHandler):

feat: constraints #679

feat: constraints #679

Uh oh!

Conversation

michdr commented Nov 26, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Empty generator in all() incorrectly returns True

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

NA values converted to empty strings in round-trip

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Missing element count validation causes crash on malformed data

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Enum member renamed breaks public API compatibility

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michdr commented Nov 26, 2025 •

edited by cursor bot

Loading

Empty generator in `all()` incorrectly returns True