fix: manifest normalization #599

lmossman · 2025-06-14T00:18:52Z

What

This PR makes some changes to the manifest_normalizer logic to better align with how the SchemaForm-Based Builder needs things to be normalized.

Notably, this PR:

Removes the logic to extract schema_loader.schema fields into the schemas field and point to them with $refs in the streams. The reason for this is that maintaining this structure in the frontend without issue would require complicating the UI logic more than I'd like to at this time. Since we expect most connector maintainers to be using the Builder UI once SchemaForm is fully released, it doesn't seem a big deal to start including the schema definition inline in the stream definition directly.
Adds _replace_parent_streams_with_refs, which finds the parent_stream_configs in streams and check if their schema field value matches another stream definition in the manifest, and if so replace it with a $ref pointing to that stream. This cuts down a lot on duplication in the manifests and the Builder UI will be updated in a later PR to properly handle the $ref for that field by showing a dropdown for selecting a stream using a UI override.
Adds _clean_dangling_fields, which looks for fields under the definitions and schemas manifest fields which don't have any $ref pointing to them anywhere in the manifest, which indicates that they are unused, and removes them from the manifest. This ensures unnecessary values are removed as manifests are transitioned to the new structure.

Unit tests were adjusted to account for these changes.

Stay tuned for a separate Builder UI PR that will take advantage of these changes.

Summary by CodeRabbit

Refactor
- Enhanced manifest normalization by replacing embedded parent stream objects with references and pruning unreferenced definitions and schemas.
- Removed previous schema extraction and referencing logic for a simplified process.
Tests
- Added focused unit tests for cleaning dangling fields and replacing parent stream references.
- Removed obsolete tests and fixtures tied to prior schema referencing approaches.
Chores
- Deleted outdated test fixtures and resources related to removed schema referencing features.

Important

Auto-merge enabled.

This PR is set to merge automatically when all requirements are met.

Note

Auto-merge may have been disabled. Please check the PR status to confirm.

lmossman · 2025-06-14T00:25:26Z

/format-fix

github-actions · 2025-06-14T00:25:59Z

PyTest Results (Fast)

3 675 tests +9 3 664 ✅ +9 6m 27s ⏱️ +32s
1 suites ±0 11 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit ff36d87. ± Comparison against base commit dcdd18b.

This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.

unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_with_linked_definitions_url_base_authenticator_when_multiple_streams_reference_the_same_schema

unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_all_records_without_cursor-manifest2-mock_requests2-expected_records2-2-initial_state2-expected_state2]
unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_one_record_without_cursor-manifest1-mock_requests1-expected_records1-2-initial_state1-expected_state1]
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_keeps_parent_paths
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_empty_sections
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_definitions
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_schemas
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_handles_multiple_partition_routers
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_no_match
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_replaces_with_ref
unit_tests.sources.declarative.resolvers.test_parametrized_components_resolver ‑ test_dynamic_streams_with_parametrized_components_resolver

♻️ This comment has been updated with latest results.

Copilot

Pull Request Overview

This PR refactors the manifest normalization process to remove separate schema referencing, inline schemas directly in streams, and add cleanup steps to reduce duplication and remove unused definitions.

Renames the deduplication method to correct a typo and centralizes all cleanup under _deduplicate_manifest
Removes the old schema extraction/reference logic in favor of inline schemas
Introduces _replace_parent_streams_with_refs to collapse duplicate parent streams into $refs and _clean_dangling_fields to prune unused definitions/schemas

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py	Removed obsolete linked-schema tests and added tests for dangling-field cleanup and parent-stream `$ref` replacement
unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml	Deleted outdated test manifest resource
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py	Renamed `_deduplicate_minifest` → `_deduplicate_manifest`, removed schema‐extraction methods, and added `_replace_parent_streams_with_refs` & `_clean_dangling_fields`

Comments suppressed due to low confidence (1)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py:110

The docstring for normalize still mentions "resolving schema references", but the old _reference_schemas logic was removed. Please update it to reference the new parent-stream replacement and dangling-field cleanup steps.

def normalize(self) -> ManifestType:

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py

coderabbitai · 2025-06-14T00:33:07Z

📝 Walkthrough

Walkthrough

This change overhauls the manifest normalization logic in the declarative source parser. It removes schema extraction and referencing, introduces new steps for replacing embedded parent stream objects with references, and cleans up unreferenced definitions and schemas. Test fixtures and unit tests are updated to reflect the revised normalization behavior.

Changes

File(s)	Change Summary
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py	Refactored manifest normalization: removed schema extraction/referencing methods, added `_replace_parent_streams_with_refs` and `_clean_dangling_fields`, and updated deduplication logic.
unit_tests/sources/declarative/parsers/conftest.py	Updated and removed fixtures to match new normalization logic: replaced schema references with inlined schemas or removed schema loaders and root schemas; deleted fixtures related to abnormal schemas.
unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml	Deleted the abnormal schemas manifest YAML file, removing a test resource for previously supported normalization cases.
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py	Removed a high-level test for linked definitions/authenticator normalization; added granular tests for `_clean_dangling_fields` and `_replace_parent_streams_with_refs` methods, verifying correct reference replacement and cleanup behavior.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ManifestNormalizer
    participant Manifest

    User->>ManifestNormalizer: normalize(manifest)
    ManifestNormalizer->>ManifestNormalizer: _deduplicate_manifest()
    ManifestNormalizer->>ManifestNormalizer: _replace_parent_streams_with_refs()
    ManifestNormalizer->>ManifestNormalizer: _clean_dangling_fields()
    ManifestNormalizer-->>User: normalized manifest

Possibly related PRs

airbytehq/airbyte-python-cdk#447: Introduced initial ManifestNormalizer with schema deduplication and referencing methods; this PR replaces and expands upon that approach.

Suggested labels

declarative-component-schema, airbyte-python-cdk

Suggested reviewers

maxi297

Would you like to review if the new normalization logic covers all your intended edge cases, or should we add more tests for additional manifest structures, wdyt?

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3)
73-80: Doc-string typos and stale wording

The class doc-string still mentions “appliying” and the (miss-spelled) normilize() extension point, but that method no longer exists. Shall we clean the wording and spelling so future readers aren’t confused, wdyt?

95-105: Public-API doc out of sync with implementation

normalize() advertises a second step “resolves and references schemas”, yet the schema-ref logic was removed in this PR. Could we update the doc-string (and maybe the numbered list) to reflect the new flow, wdyt?

156-161: Redundant local import copy

copy is already imported at module level (line 5). Dropping the in-function import would avoid shadowing and keeps import style consistent. Patch example – ok to apply?
-        import copy
-
-        streams = self._normalized_manifest.get(STREAMS_TAG, [])
-        # Use deep copy for comparison to avoid mutation issues
-        stream_copies = [copy.deepcopy(s) for s in streams]
+        streams = self._normalized_manifest.get(STREAMS_TAG, [])
+        # Use deep copy for comparison to avoid mutation issues
+        stream_copies = [copy.deepcopy(s) for s in streams]
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (1)

51-82: Tests rely on private helpers

The new unit tests call the private methods _clean_dangling_fields and _replace_parent_streams_with_refs directly. Would it be safer to exercise these behaviours via the public normalize() API (or expose a public hook) so refactors don’t force test rewrites, wdyt?

unit_tests/sources/declarative/parsers/conftest.py (1)

146-171: Minor fixture optimisation

The repeated JSON-schema boilerplate ("$schema": …, additionalProperties) appears verbatim in every inline schema. Maybe extracting it into a shared constant or using pytest parametrize could shrink the fixture and improve readability – open to it?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dcdd18b and a4f1267.

📒 Files selected for processing (4)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3 hunks)
unit_tests/sources/declarative/parsers/conftest.py (6 hunks)
unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml (0 hunks)
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (1 hunks)

💤 Files with no reviewable changes (1)

unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-amplitude
GitHub Check: Check: source-intercom
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: SDM Docker Image Build

🔇 Additional comments (1)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (1)

200-215: Potential false negatives on non-local $refs

find_all_refs strips the first two chars (#/) unconditionally. If a $ref points outside the document (e.g. "../common.yaml#/definitions/foo"), the leading chars won’t be #/, so the slice produces an off-by-two result. Should we guard with value.startswith("#/") before slicing to avoid truncating external refs, wdyt?

github-actions · 2025-06-14T00:46:50Z

PyTest Results (Full)

3 678 tests +9 3 667 ✅ +9 17m 41s ⏱️ -53s
1 suites ±0 11 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit ff36d87. ± Comparison against base commit dcdd18b.

This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.

unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_with_linked_definitions_url_base_authenticator_when_multiple_streams_reference_the_same_schema

unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_all_records_without_cursor-manifest2-mock_requests2-expected_records2-2-initial_state2-expected_state2]
unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_one_record_without_cursor-manifest1-mock_requests1-expected_records1-2-initial_state1-expected_state1]
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_keeps_parent_paths
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_empty_sections
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_definitions
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_schemas
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_handles_multiple_partition_routers
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_no_match
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_replaces_with_ref
unit_tests.sources.declarative.resolvers.test_parametrized_components_resolver ‑ test_dynamic_streams_with_parametrized_components_resolver

♻️ This comment has been updated with latest results.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (1)

150-192: Efficient parent stream reference replacement logic.

The hash-based lookup approach nicely addresses the previous performance concern about O(n²) comparisons. A few observations:

The handling of both single router and list of routers is robust

Only targeting SubstreamPartitionRouter type seems intentional and appropriate

The deep equality check via hashing ensures accurate matching

One question: should we consider adding a comment about the MD5 hash collision possibility, even though it's extremely low probability for this use case? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a1d874 and ff36d87.

📒 Files selected for processing (1)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: Check: source-intercom
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-amplitude
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)

🔇 Additional comments (4)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (4)

111-111: LGTM! Typo fix in method call.

Good catch on fixing the typo from _deduplicate_minifest to _deduplicate_manifest.

133-133: LGTM! Typo fix in method signature.

Consistent with the method call fix above.

143-146: Well-structured normalization pipeline.

The addition of _replace_parent_streams_with_refs() and _clean_dangling_fields() creates a logical flow: handle duplicates → replace parent streams → clean up unused references. This aligns well with the PR objectives to simplify manifest structure while reducing duplication.

194-252: Comprehensive cleanup logic with good performance characteristics.

The implementation effectively finds all $ref paths and removes unreferenced definitions/schemas. I appreciate that it:

Recursively traverses the entire manifest to find references

Only operates on top-level keys (as per previous feedback for performance)

Properly handles empty sections by removing them entirely

Uses startswith() to handle nested reference paths

The nested function approach keeps the logic clean and readable. The implementation looks solid for the cleanup requirements!

aldogonzalez8

APPROVED
This makes sense to me, lets 🚢

lmossman added 4 commits June 12, 2025 13:35

ignore streams without schemas when normalizing them

34f8796

remove schema normalization

970eb3f

add refs for parent streams

78c092b

uncomment replace parent streams with refs

b3b39f8

github-actions bot added bug Something isn't working security labels Jun 14, 2025

run poetry run ruff format

a4f1267

lmossman marked this pull request as ready for review June 14, 2025 00:28

lmossman requested review from Copilot and aldogonzalez8 June 14, 2025 00:28

Copilot AI reviewed Jun 14, 2025

View reviewed changes

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py Show resolved Hide resolved

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py Show resolved Hide resolved

coderabbitai bot reviewed Jun 14, 2025

View reviewed changes

coderabbitai bot approved these changes Jun 14, 2025

View reviewed changes

lmossman added 2 commits June 25, 2025 10:25

use stream hashing instead of direct comparisons

2a1d874

run format

ff36d87

coderabbitai bot reviewed Jun 25, 2025

View reviewed changes

aldogonzalez8 approved these changes Jun 25, 2025

View reviewed changes

lmossman enabled auto-merge (squash) June 27, 2025 01:15

lmossman disabled auto-merge June 27, 2025 01:15

lmossman merged commit 0b4195b into main Jun 27, 2025
27 of 28 checks passed

lmossman deleted the lmossman/fix-schema-normalization branch June 27, 2025 01:15

fix: manifest normalization #599

fix: manifest normalization #599

Uh oh!

Conversation

lmossman commented Jun 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Summary by CodeRabbit

Uh oh!

lmossman commented Jun 14, 2025

Uh oh!

github-actions bot commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

aldogonzalez8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lmossman commented Jun 14, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jun 14, 2025 •

edited

Loading

coderabbitai bot commented Jun 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

github-actions bot commented Jun 14, 2025 •

edited

Loading