-
Notifications
You must be signed in to change notification settings - Fork 16.6k
fix(examples): preserve UUIDs from YAML configs for dataset matching #37557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The example loading flow has two code paths that weren't communicating UUIDs properly: 1. Data loading (load_parquet_table) created SqlaTable without UUID 2. Config import (import_dataset) looks up datasets by UUID only This caused dataset matching failures when charts/dashboards referenced datasets by UUID. Changes: - Add uuid parameter to load_parquet_table() and create_generic_loader() - Extract uuid from dataset.yaml in get_dataset_config_from_yaml() - Extract uuid from datasets/*.yaml in _get_multi_dataset_config() - Pass uuid through discover_datasets() to the loaders - Use UUID-first lookup to avoid unique constraint violations - Backfill UUID on existing datasets found by table_name - Add pylint disable comments for pre-existing transaction warnings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Refactor the UUID-first lookup logic into a reusable helper function that is used in both the early-return path (table exists) and the post-load metadata path (table missing/force). This ensures consistent behavior and avoids potential unique constraint violations when duplicate metadata rows exist and the physical table was dropped. Changes: - Add _find_dataset() helper for UUID-first, table_name fallback lookup - Refactor both code paths to use the helper - Add tests for the helper function - Add test for duplicate rows + missing table edge case Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…hema collisions Adds schema parameter to _find_dataset() fallback lookup so that two datasets with the same table_name in different schemas don't collide during UUID backfill. Adds test to verify schema-based lookup distinguishes same-name tables. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sets tbl.schema when creating new SqlaTable objects and backfills schema on existing tables that have schema=None. This ensures the schema-aware lookup in _find_dataset() can find datasets created before this fix. Adds tests for schema setting and backfilling behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds 13 new tests covering all Codex-suggested cases: data_loading_test.py (5 new): - test_get_dataset_config_from_yaml_schema_main: schema "main" → None - test_get_dataset_config_from_yaml_empty_file: Empty YAML handling - test_get_dataset_config_from_yaml_invalid_yaml: Invalid YAML handling - test_get_multi_dataset_config_schema_main: schema "main" in multi-dataset - test_get_multi_dataset_config_missing_table_name: Falls back to dataset_name generic_loader_test.py (8 new): - test_find_dataset_no_uuid_no_schema: Basic lookup without UUID/schema - test_find_dataset_not_found: Returns (None, False) when nothing matches - test_load_parquet_table_no_backfill_when_uuid_already_set: Preserve UUID - test_load_parquet_table_no_backfill_when_schema_already_set: Preserve schema - test_load_parquet_table_both_uuid_and_schema_backfill: Backfill both - test_create_generic_loader_passes_schema: Schema propagation - test_create_generic_loader_description_set: Description applied - test_create_generic_loader_no_description: No description path Total: 32 tests covering UUID/schema extraction, lookup, backfill, preservation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sequence DiagramThe PR threads UUIDs from dataset YAML configs through example discovery into the generic Parquet loader and changes dataset lookup to prefer UUID (with schema fallback). This prevents duplicate datasets and backfills UUID/schema on existing metadata when needed. sequenceDiagram
participant CLI
participant DataLoading
participant GenericLoader
participant Database
CLI->>DataLoading: discover_datasets() -> read dataset.yaml (includes uuid)
DataLoading->>GenericLoader: create_generic_loader(..., uuid=from_yaml)
CLI->>GenericLoader: invoke loader -> load_parquet_table(uuid)
GenericLoader->>Database: _find_dataset(uuid first; else table_name+schema)
alt Dataset found by UUID
Database-->>GenericLoader: return existing SqlaTable (no changes)
else Not found
GenericLoader->>Database: create/load table, create/merge SqlaTable
GenericLoader->>Database: set/backfill tbl.uuid and tbl.schema if provided
Database-->>GenericLoader: merged SqlaTable
end
GenericLoader-->>CLI: return dataset (matched or created)
Generated by CodeAnt AI |
| tbl = ( | ||
| db.session.query(SqlaTable) | ||
| .filter_by(table_name=table_name, database_id=database_id, schema=schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: The _find_dataset fallback query filters by schema=schema, so when an existing dataset row has schema=None but the caller passes a non-null schema (the common backfill case), the row is not found; this prevents UUID/schema backfill from ever running and instead leads to creating a new dataset row for the same physical table, causing duplicates and leaving the original broken row in place. [logic error]
Severity Level: Critical 🚨
- ❌ Duplicate dataset rows created during examples import.
- ❌ Examples load creating new datasets instead of backfilling UUIDs.
- ⚠️ Backfill of legacy datasets (schema/UUID) fails silently.
- ⚠️ Example re-run produces dataset lookup collisions.| tbl = ( | |
| db.session.query(SqlaTable) | |
| .filter_by(table_name=table_name, database_id=database_id, schema=schema) | |
| # First try to match on the exact schema for cross-schema safety | |
| tbl = ( | |
| db.session.query(SqlaTable) | |
| .filter_by(table_name=table_name, database_id=database_id, schema=schema) | |
| .first() | |
| ) | |
| # If nothing matches the requested schema, fall back to rows with no schema | |
| # so we can backfill schema/UUID on legacy datasets that have schema=None. | |
| if not tbl and schema is not None: | |
| tbl = ( | |
| db.session.query(SqlaTable) | |
| .filter_by(table_name=table_name, database_id=database_id, schema=None) |
Steps of Reproduction ✅
1. Discover example loaders: discover_datasets() in
superset/examples/data_loading.py:143-150 calls create_generic_loader(...,
schema=config[\"schema\"], uuid=config.get(\"uuid\")) (file:
superset/examples/data_loading.py, lines 143-150). This yields a loader that closes over a
non-null schema value when dataset.yaml specifies a schema.
2. Invoke the loader created by create_generic_loader: create_generic_loader -> loader()
calls load_parquet_table(...) with the captured schema and uuid
(superset/examples/generic_loader.py:280-289). See create_generic_loader definition and
loader invocation (generic_loader.py, lines 272-289).
3. load_parquet_table receives schema!=None and uuid and reaches the dataset lookup: it
calls _find_dataset(table_name, database.id, uuid, schema) (generic_loader.py, line 216).
_find_dataset first tries UUID then falls back to a single filter_by that includes
schema=schema (generic_loader.py, lines 61-71 and 216-217).
4. Real-world problematic state: an existing SqlaTable row was created earlier by the old
data-loading path without setting schema (row.schema is NULL) and uuid is NULL. In that
case _find_dataset's fallback filter_by(..., schema=schema) will not match the NULL-schema
row because schema!=NULL, so tbl remains None and found_by_uuid False. This causes
load_parquet_table to treat the dataset as missing and create a new SqlaTable row for the
same physical table, leaving the legacy NULL-schema row intact (duplicate dataset row).
5. Concrete reproduction with current test scaffolding: run the sequence in
tests/unit_tests/examples/generic_loader_test.py using mocks that simulate (a)
discover_datasets passing schema (data_loading.py:143-150), (b) mock DB having an existing
row with schema=None and uuid=None, and (c) load_parquet_table being called with
schema="public" and uuid set. With the existing code path, _find_dataset will not return
the NULL-schema row (because schema="public" != None), and the loader will create a new
dataset row instead of backfilling the legacy row. The mock scenarios in tests demonstrate
related behaviors (see tests around lines 618-646 and 312-345 for backfill expectations).
6. Why the improved code fixes it: after trying an exact schema match, the suggested
change falls back to searching rows with schema=None when a schema was requested (i.e.,
schema is not None). This allows the loader to find legacy rows with NULL schema so the
code can backfill tbl.schema and/or tbl.uuid and avoid creating duplicate dataset rows for
the same physical table.
7. Note on intentionality: current code intentionally includes schema in the lookup to
avoid cross-schema collisions (tests cover schema-distinguishing behavior at tests lines
471-506). The suggested change preserves exact-schema matching first (maintaining
cross-schema safety) and only falls back to schema=None when no exact-schema match exists,
which is consistent with the test expectations and addresses legacy backfill scenarios.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** superset/examples/generic_loader.py
**Line:** 67:69
**Comment:**
*Logic Error: The `_find_dataset` fallback query filters by `schema=schema`, so when an existing dataset row has `schema=None` but the caller passes a non-null schema (the common backfill case), the row is not found; this prevents UUID/schema backfill from ever running and instead leads to creating a new dataset row for the same physical table, causing duplicates and leaving the original broken row in place.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes the example loading flow by threading UUID values from YAML configurations through the data loading path, enabling proper dataset matching between the data loading and config import flows.
Changes:
- Added UUID extraction from YAML configs and propagation through the data loading pipeline
- Implemented UUID-first lookup with table_name fallback to avoid unique constraint violations
- Added schema handling with backfilling for datasets missing schema information
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| superset/examples/generic_loader.py | Adds _find_dataset() helper for UUID-first lookups, updates load_parquet_table() and create_generic_loader() to accept and set UUID parameter, implements backfilling logic for UUID and schema |
| superset/examples/data_loading.py | Updates YAML config readers to extract UUID field and passes it through to loader creation |
| tests/unit_tests/examples/generic_loader_test.py | Comprehensive unit tests for UUID handling, lookups, backfilling, and schema handling (731 lines) |
| tests/unit_tests/examples/data_loading_test.py | Unit tests for YAML config extraction including UUID handling (230 lines) |
| tests/unit_tests/examples/init.py | Standard Apache license header for new test package |
| # Backfill UUID if found by table_name (not UUID) and UUID not set | ||
| if uuid and not tbl.uuid and not found_by_uuid: | ||
| tbl.uuid = uuid | ||
| needs_update = True | ||
| # Backfill schema if existing table has no schema set | ||
| if schema and not tbl.schema: | ||
| tbl.schema = schema | ||
| needs_update = True |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UUID and schema backfill logic is duplicated in two places: lines 137-144 (early return when table exists) and lines 223-229 (after data load). This duplication increases maintenance burden and the risk of inconsistencies. Consider extracting this logic into a helper function that both code paths can call.
| tbl, found_by_uuid = _find_dataset(table_name, database.id, uuid, schema) | ||
|
|
||
| if not tbl: | ||
| tbl = SqlaTable(table_name=table_name, database_id=database.id) |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UUID should be set when creating a new SqlaTable to make the code clearer and more maintainable. Currently, the UUID is set later in the "backfill" section (lines 224-225), which works but is confusing since it's not actually backfilling for new tables. Consider setting the UUID immediately when creating the table.
| tbl = SqlaTable(table_name=table_name, database_id=database.id) | |
| tbl = SqlaTable( | |
| table_name=table_name, | |
| database_id=database.id, | |
| uuid=uuid, | |
| ) |
| mock_db.session.query.return_value.filter_by.return_value.first.return_value = ( | ||
| mock_existing_table | ||
| ) | ||
|
|
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mock setup doesn't correctly simulate the scenario being tested. When looking up by UUID, the mock should handle UUID lookup separately from table_name lookup. The test should use a side_effect function to differentiate between the two lookups, similar to test_load_parquet_table_backfills_uuid_on_existing_table at lines 140-148.
| mock_db.session.query.return_value.filter_by.return_value.first.return_value = ( | |
| mock_existing_table | |
| ) | |
| query = mock_db.session.query.return_value | |
| def _filter_by_side_effect(**kwargs: object) -> MagicMock: | |
| filtered_query = MagicMock() | |
| if "uuid" in kwargs or "table_name" in kwargs: | |
| filtered_query.first.return_value = mock_existing_table | |
| else: | |
| filtered_query.first.return_value = None | |
| return filtered_query | |
| query.filter_by.side_effect = _filter_by_side_effect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review Agent Run #e5adb3
Actionable Suggestions - 1
-
tests/unit_tests/examples/data_loading_test.py - 1
- Missing test for data_file override · Line 230-230
Additional Suggestions - 2
-
tests/unit_tests/examples/generic_loader_test.py - 2
-
Incomplete test assertion · Line 347-347Mock load_parquet_table and assert that calling the generated loader() passes the correct keyword arguments: parquet_file, table_name, only_metadata, force, sample_rows, data_file, schema, and uuid.
-
Incorrect Test Assertion · Line 731-731The assertion allows the description to be set as long as it's not 'anything', but the test intent (per comment) is to ensure no description is assigned when None is passed. This weakens the test's ability to catch incorrect behavior.
Code suggestion
@@ -731,1 +731,1 @@ - assert not hasattr(mock_tbl, "description") or mock_tbl.description != "anything" + assert not hasattr(mock_tbl, "description")
-
Review Details
-
Files reviewed - 5 · Commit Range:
124a1b0..8e20e38- superset/examples/data_loading.py
- superset/examples/generic_loader.py
- tests/unit_tests/examples/__init__.py
- tests/unit_tests/examples/data_loading_test.py
- tests/unit_tests/examples/generic_loader_test.py
-
Files skipped - 0
-
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful
Bito Usage Guide
Commands
Type the following command in the pull request comment and save the comment.
-
/review- Manually triggers a full AI review. -
/pause- Pauses automatic reviews on this pull request. -
/resume- Resumes automatic reviews. -
/resolve- Marks all Bito-posted review comments as resolved. -
/abort- Cancels all in-progress reviews.
Refer to the documentation for additional commands.
Configuration
This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.
Documentation & Help
|
|
||
| # Falls back to dataset_name when table_name not in YAML | ||
| assert result["table_name"] == "my_dataset" | ||
| assert result["uuid"] == "test-uuid-5678" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the data_file override feature in _get_multi_dataset_config isn't tested. This could hide bugs in that logic. Consider adding a test where the YAML specifies a data_file that exists, and assert the result uses the overridden path.
Code suggestion
Check the AI-generated fix before applying
| assert result["uuid"] == "test-uuid-5678" | |
| assert result["uuid"] == "test-uuid-5678" | |
| def test_get_multi_dataset_config_data_file_override(tmp_path: Path) -> None: | |
| """Test that data_file can be overridden from YAML.""" | |
| from superset.examples.data_loading import _get_multi_dataset_config | |
| datasets_dir = tmp_path / "datasets" | |
| datasets_dir.mkdir() | |
| data_dir = tmp_path / "data" | |
| data_dir.mkdir() | |
| # Create the overridden data file | |
| overridden_file = data_dir / "overridden.parquet" | |
| overridden_file.write_text("dummy") | |
| yaml_content = """table_name: my_dataset | |
| data_file: overridden.parquet | |
| uuid: test-uuid-override | |
| """ | |
| dataset_yaml = datasets_dir / "my_dataset.yaml" | |
| dataset_yaml.write_text(yaml_content) | |
| original_data_file = tmp_path / "data" / "my_dataset.parquet" | |
| result = _get_multi_dataset_config(tmp_path, "my_dataset", original_data_file) | |
| assert result["data_file"] == overridden_file | |
| assert result["uuid"] == "test-uuid-override" |
Code Review Run #e5adb3
Should Bito avoid suggestions like this for future reviews? (Manage Rules)
- Yes, avoid them
|
@sadpandajoe This workflow is deprecated! Please use the new Superset Showtime system instead:
Processing your ephemeral environment request here. Action: up. More information on how to use or configure ephemeral environments |
Instead of checking if default_schema == 'main', check the database backend to determine if schemas are supported. This is more robust and extensible - just add backends to the no_schema_backends set. Currently only SQLite is in this set, but can easily add others if needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@sadpandajoe Ephemeral environment spinning up at http://54.245.151.60:8080. Credentials are 'admin'/'admin'. Please allow several minutes for bootstrapping and startup. |
|
🎪 Showtime deployed environment on GHA for feee7b8 • Environment: http://44.255.188.64:8080 (admin/admin) |
SUMMARY
Fixes the example loading flow where datasets created by
load_parquet_table()couldn't be matched by the import flow that looks up datasets by UUID.Root cause: The example loading has two separate code paths:
data_loading.py→generic_loader.py): CreatesSqlaTablewithout setting UUIDload_examples_from_configs()): Looks up existing datasets by UUID onlyWhen a dataset is created without UUID, the import flow can't find it and either fails or creates duplicates.
Solution: Thread UUID from YAML configs through the data loading path:
uuidfield from YAML configs indata_loading.pycreate_generic_loader()andload_parquet_table()SqlaTableobjectsuuid=NoneBEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A - Backend-only change
TESTING INSTRUCTIONS
Run the new unit tests:
All 32 tests should pass.
Manual verification:
[ ] spin up
testenv-upand verify charts and dashboard exist[ ] spin up showtime and verify charts and dashboards exist
ADDITIONAL INFORMATION