-
Notifications
You must be signed in to change notification settings - Fork 62
feat: constraints #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: constraints #679
Conversation
- replace '|' separator with record separator (\x1E) to avoid collisions - implement escaping mechanism: double separator if it appears in data - add validation for malformed splits with graceful error handling - update tests to cover edge cases with separator characters - remove backward compatibility as not needed
| key = "|".join(columns) | ||
| hash_suffix = hashlib.md5(key.encode()).hexdigest()[:8] | ||
| columns_str = "_".join(col.upper() for col in columns) | ||
| return f"__TABULAR_CONSTRAINT_{prefix}_{columns_str}_{hash_suffix}__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just __CONSTRAINT_ rather than __TABULAR_CONSTRAINT_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the initial thought was to be specific about the model, as well. Doesn't have to be, and might be meaningless in possible future constraint types (e.g. cross table).
| ) | ||
|
|
||
|
|
||
| class FixedCombinationHandler(ConstraintHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we separate out constraints into a dedicated directory, with each constraint type being defined in their own file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not. With only 2 constraint types it's quite compact. Once we add a few more, I can see the benefit of separate modules.
mostlyai/sdk/domain.py
Outdated
| # presumably not initialized yet, so we skip this validation | ||
| continue | ||
|
|
||
| if isinstance(typed_constraint, FixedCombination): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we generalize this? i.e. can't the constraint handler itself register that method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the easiest would be to add a validation method as part of an interface, which would be the base for FixedCombination & Inequality
mostlyai/sdk/domain.py
Outdated
|
|
||
| return self | ||
|
|
||
| def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like it should belong to the same file, where we defined the FixedCombination constraint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good point! 👍🏼
|
|
||
| return self | ||
|
|
||
| def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned, this feels out of place here.
tools/model.py
Outdated
|
|
||
| return self | ||
|
|
||
| def _validate_constraint_fixed_combination(self, constraint, table_columns, column_usage, idx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned, this feels out of place here
| col.model_encoding_type in datetime_encodings | ||
| for col in table.columns | ||
| if col.name in {self.low_column, self.high_column} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty generator in all() incorrectly returns True
Medium Severity
The _is_datetime detection uses all() on a generator expression that filters columns by name. If no columns match the filter (i.e., neither low_column nor high_column is found in table.columns), the generator is empty and all([]) returns True. This would incorrectly set _is_datetime = True for numeric constraints, causing the handler to treat numeric data as datetime and apply incorrect epoch-based transformations. While validation typically catches missing columns, this is still a logical error that could cause data corruption if validation is bypassed.
| return ["_RARE_"] * len(self.columns) | ||
| try: | ||
| values = json.loads(merged_value) | ||
| return [str(v) if v is not None else "" for v in values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NA values converted to empty strings in round-trip
Medium Severity
In to_original, the split_row function converts None values (which represent original NA/null values in the JSON-serialized data) to empty strings "" instead of preserving them as proper null values. The expression str(v) if v is not None else "" loses null information. When categorical columns contain NA values, they will become empty strings after the round-trip transformation, causing data loss. The same issue exists on line 67 where all columns get empty strings when merged_value is NA.
| return ["_RARE_"] * len(self.columns) | ||
| try: | ||
| values = json.loads(merged_value) | ||
| return [str(v) if v is not None else "" for v in values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing element count validation causes crash on malformed data
Medium Severity
The split_row function in to_original parses JSON values but doesn't validate that the number of elements matches len(self.columns). If the synthetic model generates a malformed merged value with fewer elements than expected (e.g., '["a"]' when 3 columns are expected), the resulting split_df will have fewer columns. When the loop accesses split_df[i] where i exceeds the actual column count, a KeyError is raised. This could crash the generation pipeline when processing synthetic data.
mostlyai/sdk/domain.py
Outdated
|
|
||
| append = "APPEND" | ||
| replace = "REPLACE" | ||
| replace_ = "REPLACE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum member renamed breaks public API compatibility
High Severity
The IfExists enum member was renamed from replace to replace_. This is a breaking API change that will cause an AttributeError for any code using IfExists.replace. External consumers of this SDK who use this enum member for connector write operations will experience runtime failures after upgrading.
Note
Adds first-class constraints to ensure valid value combinations and inequalities in synthetic data.
FixedCombinations,Inequality) and validators; newConstraint,ConstraintConfig, andConstraintTypeindomain.py_data/constraints/transformations.py) to merge/split columns and compute deltas; updatestgt-dataparquet andencoding-types.jsonduring trainingstep_pull_training_data.pyand revert internal columns post-generation instep_generate_data.pyFixedCombinationsHandler(JSON-merged categorical) andInequalityHandler(numeric/datetime delta, NA-safe)random_stateandconstraintsWritten by Cursor Bugbot for commit 4777ff0. This will update automatically on new commits. Configure here.