Skip to content

fix: manifest normalization #599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 27, 2025
Merged

Conversation

lmossman
Copy link
Contributor

@lmossman lmossman commented Jun 14, 2025

What

This PR makes some changes to the manifest_normalizer logic to better align with how the SchemaForm-Based Builder needs things to be normalized.

Notably, this PR:

  • Removes the logic to extract schema_loader.schema fields into the schemas field and point to them with $refs in the streams. The reason for this is that maintaining this structure in the frontend without issue would require complicating the UI logic more than I'd like to at this time. Since we expect most connector maintainers to be using the Builder UI once SchemaForm is fully released, it doesn't seem a big deal to start including the schema definition inline in the stream definition directly.
  • Adds _replace_parent_streams_with_refs, which finds the parent_stream_configs in streams and check if their schema field value matches another stream definition in the manifest, and if so replace it with a $ref pointing to that stream. This cuts down a lot on duplication in the manifests and the Builder UI will be updated in a later PR to properly handle the $ref for that field by showing a dropdown for selecting a stream using a UI override.
  • Adds _clean_dangling_fields, which looks for fields under the definitions and schemas manifest fields which don't have any $ref pointing to them anywhere in the manifest, which indicates that they are unused, and removes them from the manifest. This ensures unnecessary values are removed as manifests are transitioned to the new structure.

Unit tests were adjusted to account for these changes.

Stay tuned for a separate Builder UI PR that will take advantage of these changes.

Summary by CodeRabbit

  • Refactor

    • Enhanced manifest normalization by replacing embedded parent stream objects with references and pruning unreferenced definitions and schemas.
    • Removed previous schema extraction and referencing logic for a simplified process.
  • Tests

    • Added focused unit tests for cleaning dangling fields and replacing parent stream references.
    • Removed obsolete tests and fixtures tied to prior schema referencing approaches.
  • Chores

    • Deleted outdated test fixtures and resources related to removed schema referencing features.

Important

Auto-merge enabled.

This PR is set to merge automatically when all requirements are met.

Note

Auto-merge may have been disabled. Please check the PR status to confirm.

@github-actions github-actions bot added bug Something isn't working security labels Jun 14, 2025
@lmossman
Copy link
Contributor Author

/format-fix

Copy link

github-actions bot commented Jun 14, 2025

PyTest Results (Fast)

3 675 tests  +9   3 664 ✅ +9   6m 27s ⏱️ +32s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit ff36d87. ± Comparison against base commit dcdd18b.

This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_with_linked_definitions_url_base_authenticator_when_multiple_streams_reference_the_same_schema
unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_all_records_without_cursor-manifest2-mock_requests2-expected_records2-2-initial_state2-expected_state2]
unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_one_record_without_cursor-manifest1-mock_requests1-expected_records1-2-initial_state1-expected_state1]
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_keeps_parent_paths
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_empty_sections
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_definitions
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_schemas
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_handles_multiple_partition_routers
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_no_match
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_replaces_with_ref
unit_tests.sources.declarative.resolvers.test_parametrized_components_resolver ‑ test_dynamic_streams_with_parametrized_components_resolver

♻️ This comment has been updated with latest results.

@lmossman lmossman marked this pull request as ready for review June 14, 2025 00:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the manifest normalization process to remove separate schema referencing, inline schemas directly in streams, and add cleanup steps to reduce duplication and remove unused definitions.

  • Renames the deduplication method to correct a typo and centralizes all cleanup under _deduplicate_manifest
  • Removes the old schema extraction/reference logic in favor of inline schemas
  • Introduces _replace_parent_streams_with_refs to collapse duplicate parent streams into $refs and _clean_dangling_fields to prune unused definitions/schemas

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py Removed obsolete linked-schema tests and added tests for dangling-field cleanup and parent-stream $ref replacement
unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml Deleted outdated test manifest resource
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py Renamed _deduplicate_minifest_deduplicate_manifest, removed schema‐extraction methods, and added _replace_parent_streams_with_refs & _clean_dangling_fields
Comments suppressed due to low confidence (1)

airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py:110

  • The docstring for normalize still mentions "resolving schema references", but the old _reference_schemas logic was removed. Please update it to reference the new parent-stream replacement and dangling-field cleanup steps.
def normalize(self) -> ManifestType:

Copy link
Contributor

coderabbitai bot commented Jun 14, 2025

📝 Walkthrough

Walkthrough

This change overhauls the manifest normalization logic in the declarative source parser. It removes schema extraction and referencing, introduces new steps for replacing embedded parent stream objects with references, and cleans up unreferenced definitions and schemas. Test fixtures and unit tests are updated to reflect the revised normalization behavior.

Changes

File(s) Change Summary
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py Refactored manifest normalization: removed schema extraction/referencing methods, added _replace_parent_streams_with_refs and _clean_dangling_fields, and updated deduplication logic.
unit_tests/sources/declarative/parsers/conftest.py Updated and removed fixtures to match new normalization logic: replaced schema references with inlined schemas or removed schema loaders and root schemas; deleted fixtures related to abnormal schemas.
unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml Deleted the abnormal schemas manifest YAML file, removing a test resource for previously supported normalization cases.
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py Removed a high-level test for linked definitions/authenticator normalization; added granular tests for _clean_dangling_fields and _replace_parent_streams_with_refs methods, verifying correct reference replacement and cleanup behavior.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ManifestNormalizer
    participant Manifest

    User->>ManifestNormalizer: normalize(manifest)
    ManifestNormalizer->>ManifestNormalizer: _deduplicate_manifest()
    ManifestNormalizer->>ManifestNormalizer: _replace_parent_streams_with_refs()
    ManifestNormalizer->>ManifestNormalizer: _clean_dangling_fields()
    ManifestNormalizer-->>User: normalized manifest
Loading

Possibly related PRs

  • airbytehq/airbyte-python-cdk#447: Introduced initial ManifestNormalizer with schema deduplication and referencing methods; this PR replaces and expands upon that approach.

Suggested labels

declarative-component-schema, airbyte-python-cdk

Suggested reviewers

  • maxi297

Would you like to review if the new normalization logic covers all your intended edge cases, or should we add more tests for additional manifest structures, wdyt?

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3)

73-80: Doc-string typos and stale wording

The class doc-string still mentions “appliying” and the (miss-spelled) normilize() extension point, but that method no longer exists. Shall we clean the wording and spelling so future readers aren’t confused, wdyt?


95-105: Public-API doc out of sync with implementation

normalize() advertises a second step “resolves and references schemas”, yet the schema-ref logic was removed in this PR. Could we update the doc-string (and maybe the numbered list) to reflect the new flow, wdyt?


156-161: Redundant local import copy

copy is already imported at module level (line 5). Dropping the in-function import would avoid shadowing and keeps import style consistent. Patch example – ok to apply?

-        import copy
-
-        streams = self._normalized_manifest.get(STREAMS_TAG, [])
-        # Use deep copy for comparison to avoid mutation issues
-        stream_copies = [copy.deepcopy(s) for s in streams]
+        streams = self._normalized_manifest.get(STREAMS_TAG, [])
+        # Use deep copy for comparison to avoid mutation issues
+        stream_copies = [copy.deepcopy(s) for s in streams]
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (1)

51-82: Tests rely on private helpers

The new unit tests call the private methods _clean_dangling_fields and _replace_parent_streams_with_refs directly. Would it be safer to exercise these behaviours via the public normalize() API (or expose a public hook) so refactors don’t force test rewrites, wdyt?

unit_tests/sources/declarative/parsers/conftest.py (1)

146-171: Minor fixture optimisation

The repeated JSON-schema boilerplate ("$schema": …, additionalProperties) appears verbatim in every inline schema. Maybe extracting it into a shared constant or using pytest parametrize could shrink the fixture and improve readability – open to it?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dcdd18b and a4f1267.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3 hunks)
  • unit_tests/sources/declarative/parsers/conftest.py (6 hunks)
  • unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml (0 hunks)
  • unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (1 hunks)
💤 Files with no reviewable changes (1)
  • unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-amplitude
  • GitHub Check: Check: source-intercom
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: SDM Docker Image Build
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (1)

200-215: Potential false negatives on non-local $refs

find_all_refs strips the first two chars (#/) unconditionally. If a $ref points outside the document (e.g. "../common.yaml#/definitions/foo"), the leading chars won’t be #/, so the slice produces an off-by-two result. Should we guard with value.startswith("#/") before slicing to avoid truncating external refs, wdyt?

Copy link

github-actions bot commented Jun 14, 2025

PyTest Results (Full)

3 678 tests  +9   3 667 ✅ +9   17m 41s ⏱️ -53s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit ff36d87. ± Comparison against base commit dcdd18b.

This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_with_linked_definitions_url_base_authenticator_when_multiple_streams_reference_the_same_schema
unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_all_records_without_cursor-manifest2-mock_requests2-expected_records2-2-initial_state2-expected_state2]
unit_tests.sources.declarative.incremental.test_concurrent_perpartitioncursor ‑ test_incremental_parent_state[test_incremental_parent_state_one_record_without_cursor-manifest1-mock_requests1-expected_records1-2-initial_state1-expected_state1]
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_keeps_parent_paths
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_empty_sections
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_definitions
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_clean_dangling_fields_removes_unreferenced_schemas
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_handles_multiple_partition_routers
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_no_match
unit_tests.sources.declarative.parsers.test_manifest_normalizer ‑ test_replace_parent_streams_with_refs_replaces_with_ref
unit_tests.sources.declarative.resolvers.test_parametrized_components_resolver ‑ test_dynamic_streams_with_parametrized_components_resolver

♻️ This comment has been updated with latest results.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (1)

150-192: Efficient parent stream reference replacement logic.

The hash-based lookup approach nicely addresses the previous performance concern about O(n²) comparisons. A few observations:

  • The handling of both single router and list of routers is robust
  • Only targeting SubstreamPartitionRouter type seems intentional and appropriate
  • The deep equality check via hashing ensures accurate matching

One question: should we consider adding a comment about the MD5 hash collision possibility, even though it's extremely low probability for this use case? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a1d874 and ff36d87.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-amplitude
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (4)

111-111: LGTM! Typo fix in method call.

Good catch on fixing the typo from _deduplicate_minifest to _deduplicate_manifest.


133-133: LGTM! Typo fix in method signature.

Consistent with the method call fix above.


143-146: Well-structured normalization pipeline.

The addition of _replace_parent_streams_with_refs() and _clean_dangling_fields() creates a logical flow: handle duplicates → replace parent streams → clean up unused references. This aligns well with the PR objectives to simplify manifest structure while reducing duplication.


194-252: Comprehensive cleanup logic with good performance characteristics.

The implementation effectively finds all $ref paths and removes unreferenced definitions/schemas. I appreciate that it:

  • Recursively traverses the entire manifest to find references
  • Only operates on top-level keys (as per previous feedback for performance)
  • Properly handles empty sections by removing them entirely
  • Uses startswith() to handle nested reference paths

The nested function approach keeps the logic clean and readable. The implementation looks solid for the cleanup requirements!

Copy link
Contributor

@aldogonzalez8 aldogonzalez8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVED
This makes sense to me, lets 🚢

@lmossman lmossman enabled auto-merge (squash) June 27, 2025 01:15
@lmossman lmossman disabled auto-merge June 27, 2025 01:15
@lmossman lmossman merged commit 0b4195b into main Jun 27, 2025
27 of 28 checks passed
@lmossman lmossman deleted the lmossman/fix-schema-normalization branch June 27, 2025 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants