Skip to content

Conversation

@michdr
Copy link
Contributor

@michdr michdr commented Nov 26, 2025

Note

Adds first-class constraints to ensure valid value combinations and inequalities in synthetic data.

  • Introduces typed constraints (FixedCombinations, Inequality) and validators; new Constraint, ConstraintConfig, and ConstraintType in domain.py
  • New transformation pipeline (_data/constraints/transformations.py) to merge/split columns and compute deltas; updates tgt-data parquet and encoding-types.json during training
  • Hooks into local pipeline: preprocess constraints in step_pull_training_data.py and revert internal columns post-generation in step_generate_data.py
  • Constraint handlers: FixedCombinationsHandler (JSON-merged categorical) and InequalityHandler (numeric/datetime delta, NA-safe)
  • SDK API docs updated with constraint usage example; local generator config now preserves random_state and constraints
  • Minor: suppress sklearn PCA RuntimeWarnings in model report; add auto column extraction from base64 data
  • Tests: unit tests for handlers/translator and an end-to-end constraints test

Written by Cursor Bugbot for commit 4777ff0. This will update automatically on new commits. Configure here.

key = "|".join(columns)
hash_suffix = hashlib.md5(key.encode()).hexdigest()[:8]
columns_str = "_".join(col.upper() for col in columns)
return f"__TABULAR_CONSTRAINT_{prefix}_{columns_str}_{hash_suffix}__"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just __CONSTRAINT_ rather than __TABULAR_CONSTRAINT_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the initial thought was to be specific about the model, as well. Doesn't have to be, and might be meaningless in possible future constraint types (e.g. cross table).

)


class FixedCombinationHandler(ConstraintHandler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we separate out constraints into a dedicated directory, with each constraint type being defined in their own file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not. With only 2 constraint types it's quite compact. Once we add a few more, I can see the benefit of separate modules.

# presumably not initialized yet, so we skip this validation
continue

if isinstance(typed_constraint, FixedCombination):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we generalize this? i.e. can't the constraint handler itself register that method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the easiest would be to add a validation method as part of an interface, which would be the base for FixedCombination & Inequality


return self

def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like it should belong to the same file, where we defined the FixedCombination constraint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point! 👍🏼


return self

def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned, this feels out of place here.

tools/model.py Outdated

return self

def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned, this feels out of place here

@michdr michdr requested a review from mplatzer January 8, 2026 10:09
col.model_encoding_type in datetime_encodings
for col in table.columns
if col.name in {self.low_column, self.high_column}
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty generator in all() incorrectly returns True

Medium Severity

The _is_datetime detection uses all() on a generator expression that filters columns by name. If no columns match the filter (i.e., neither low_column nor high_column is found in table.columns), the generator is empty and all([]) returns True. This would incorrectly set _is_datetime = True for numeric constraints, causing the handler to treat numeric data as datetime and apply incorrect epoch-based transformations. While validation typically catches missing columns, this is still a logical error that could cause data corruption if validation is bypassed.

Fix in Cursor Fix in Web

return ["_RARE_"] * len(self.columns)
try:
values = json.loads(merged_value)
return [str(v) if v is not None else "" for v in values]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NA values converted to empty strings in round-trip

Medium Severity

In to_original, the split_row function converts None values (which represent original NA/null values in the JSON-serialized data) to empty strings "" instead of preserving them as proper null values. The expression str(v) if v is not None else "" loses null information. When categorical columns contain NA values, they will become empty strings after the round-trip transformation, causing data loss. The same issue exists on line 67 where all columns get empty strings when merged_value is NA.

Fix in Cursor Fix in Web

return ["_RARE_"] * len(self.columns)
try:
values = json.loads(merged_value)
return [str(v) if v is not None else "" for v in values]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing element count validation causes crash on malformed data

Medium Severity

The split_row function in to_original parses JSON values but doesn't validate that the number of elements matches len(self.columns). If the synthetic model generates a malformed merged value with fewer elements than expected (e.g., '["a"]' when 3 columns are expected), the resulting split_df will have fewer columns. When the loop accesses split_df[i] where i exceeds the actual column count, a KeyError is raised. This could crash the generation pipeline when processing synthetic data.

Fix in Cursor Fix in Web


append = "APPEND"
replace = "REPLACE"
replace_ = "REPLACE"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum member renamed breaks public API compatibility

High Severity

The IfExists enum member was renamed from replace to replace_. This is a breaking API change that will cause an AttributeError for any code using IfExists.replace. External consumers of this SDK who use this enum member for connector write operations will experience runtime failures after upgrading.

Fix in Cursor Fix in Web

@mplatzer mplatzer merged commit 066f35a into main Jan 8, 2026
14 checks passed
@mplatzer mplatzer deleted the feat-constraints branch January 8, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants