-
Notifications
You must be signed in to change notification settings - Fork 86
Ak/spec #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Ak/spec #372
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #372 +/- ##
=========================================
Coverage ? 89.41%
=========================================
Files ? 53
Lines ? 4450
Branches ? 809
=========================================
Hits ? 3979
Misses ? 325
Partials ? 146 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.
Key Changes:
- New spec-based API with Pydantic models for defining data generation configurations
- Comprehensive validation framework with error collection and reporting
- Pydantic V1/V2 compatibility layer for broad environment support
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_specs.py | Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations |
| tests/test_datasets_with_specs.py | Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications |
| tests/test_datagen_specs.py | Tests for DatagenSpec creation, validation, and generator options |
| pyproject.toml | Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores |
| makefile | Updated to use Pydantic version-specific test environments and removed .venv target |
| examples/datagen_from_specs/basic_user_datagen_spec.py | Example DatagenSpec factory for generating basic user data with pre-configured specs |
| examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py | Complex example with OHLC stock data generation including time-series and volatility modeling |
| examples/datagen_from_specs/README.md | Documentation for Pydantic-based dataset specifications with usage examples |
| dbldatagen/spec/validation.py | ValidationResult class for collecting and reporting validation errors and warnings |
| dbldatagen/spec/output_targets.py | Pydantic models for UCSchemaTarget and FilePathTarget output destinations |
| dbldatagen/spec/generator_spec_impl.py | Generator class implementing the spec-to-DataFrame conversion logic |
| dbldatagen/spec/generator_spec.py | Core DatagenSpec and TableDefinition models with comprehensive validation |
| dbldatagen/spec/compat.py | Pydantic V1/V2 compatibility layer enabling cross-version support |
| dbldatagen/spec/column_spec.py | ColumnDefinition model with validation for primary keys and constraints |
| dbldatagen/spec/init.py | Module initialization with lazy imports to avoid heavy dependencies |
| README.md | Updated feature list and formatting to mention new Pydantic-based API |
| CHANGELOG.md | Added entry for Pydantic-based specification API feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| conflicting_opts_for_pk = [ | ||
| "distribution", "template", "dataRange", "random", "omit", | ||
| "min", "max", "uniqueValues", "values", "expr" | ||
| ] | ||
|
|
||
| for opt_key in conflicting_opts_for_pk: | ||
| if opt_key in kwargs: | ||
| logger.warning( | ||
| f"Primary key '{col_name}': Option '{opt_key}' may be ignored") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we disallow min, max, and dataRange for primary keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the primary key has a min, max it might end up conflicting with num_of_rows. Eg if the number of rows is 100, and min is 10, max is 50.
|
|
||
| # Process each column | ||
| for col_def in table_spec.columns: | ||
| kwargs = self._columnSpecToDatagenColumnSpec(col_def) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use column_options instead to avoid any conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is conflict you are referring to here?
| # Write data based on destination type | ||
| if isinstance(output_destination, FilePathTarget): | ||
| output_path = posixpath.join(output_destination.base_path, table_name) | ||
| df.write.format(output_destination.output_format).mode("overwrite").save(output_path) | ||
| logger.info(f"Wrote table '{table_name}' to file path: {output_path}") | ||
|
|
||
| elif isinstance(output_destination, UCSchemaTarget): | ||
| output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}" | ||
| df.write.mode("overwrite").saveAsTable(output_table) | ||
| logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use utils.write_data_to_output for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method in utils relies on OutputDataset. See the other comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, The output target in spec would be better suited for the spec as it does validations for UC and file paths volumes. This config.OutputDataset is generic, and getting specific errors would mean essientially copying this over to the new spec folder or other way round. Is there something I am missing here ?
* added use of ABC to mark TextGenerator as abstract * Lint text generators module * Add persistence methods * Add tests and docs; Update PR template * Update hatch installation for push action * Refactor * Update method names and signatures --------- Co-authored-by: ronanstokes-db <ronan.stokes@databricks.com> Co-authored-by: Ronan Stokes <42389040+ronanstokes-db@users.noreply.github.com>
* Add files via upload Adding an example notebook on how to leverage DBLDATAGEN in the creation of oil and gas datasets/. Currently placing this in the examples/notebooks file * Update and rename [DBLDATAGEN]Oil&GasWellHeaderDailyProductionTypeCurve.py to oil_gas_data_generation.py implemented changes Greg suggested aorund function formatting * Update oil_gas_data_generation.py added documentaiton link and install clarification --------- Co-authored-by: Greg Hansen <163584195+ghanse@users.noreply.github.com>
* Add files via upload A notebook that generates synthetic log in data for various gamers. The gamers can have multiple devices that are consistent across them and are located in different areas. * Update VideoGameLoginSyntheticDataGeneration.py Fixed code with comments from PR * Update and rename VideoGameLoginSyntheticDataGeneration.py to gaming_data_generation.py
* Format data_analyzer.py and text_generator_plugins.py * Update formatting and fix docstrings * Update inferred max value for DecimalType columns * Update docstrings and inferred max value for DecimalType columns
* Added CPG supply chain dbldatagen notebook * Deleting original CPG_supply_chain_datagen.py notebook * Adding new file with the changes Adding new file in the correct location in the repo. * Cleaned up comments --------- Co-authored-by: Greg Hansen <163584195+ghanse@users.noreply.github.com>
Changes
Linked issues
Resolves #..
Requirements