-
Notifications
You must be signed in to change notification settings - Fork 24
fix: manifest normalization #599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/format-fix |
PyTest Results (Fast)3 675 tests +9 3 664 ✅ +9 6m 27s ⏱️ +32s Results for commit ff36d87. ± Comparison against base commit dcdd18b. This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the manifest normalization process to remove separate schema referencing, inline schemas directly in streams, and add cleanup steps to reduce duplication and remove unused definitions.
- Renames the deduplication method to correct a typo and centralizes all cleanup under
_deduplicate_manifest
- Removes the old schema extraction/reference logic in favor of inline schemas
- Introduces
_replace_parent_streams_with_refs
to collapse duplicate parent streams into$ref
s and_clean_dangling_fields
to prune unused definitions/schemas
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py | Removed obsolete linked-schema tests and added tests for dangling-field cleanup and parent-stream $ref replacement |
unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml | Deleted outdated test manifest resource |
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py | Renamed _deduplicate_minifest → _deduplicate_manifest , removed schema‐extraction methods, and added _replace_parent_streams_with_refs & _clean_dangling_fields |
Comments suppressed due to low confidence (1)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py:110
- The docstring for
normalize
still mentions "resolving schema references", but the old_reference_schemas
logic was removed. Please update it to reference the new parent-stream replacement and dangling-field cleanup steps.
def normalize(self) -> ManifestType:
📝 WalkthroughWalkthroughThis change overhauls the manifest normalization logic in the declarative source parser. It removes schema extraction and referencing, introduces new steps for replacing embedded parent stream objects with references, and cleans up unreferenced definitions and schemas. Test fixtures and unit tests are updated to reflect the revised normalization behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ManifestNormalizer
participant Manifest
User->>ManifestNormalizer: normalize(manifest)
ManifestNormalizer->>ManifestNormalizer: _deduplicate_manifest()
ManifestNormalizer->>ManifestNormalizer: _replace_parent_streams_with_refs()
ManifestNormalizer->>ManifestNormalizer: _clean_dangling_fields()
ManifestNormalizer-->>User: normalized manifest
Possibly related PRs
Suggested labels
Suggested reviewers
Would you like to review if the new normalization logic covers all your intended edge cases, or should we add more tests for additional manifest structures, wdyt? ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3)
73-80
: Doc-string typos and stale wordingThe class doc-string still mentions “appliying” and the (miss-spelled)
normilize()
extension point, but that method no longer exists. Shall we clean the wording and spelling so future readers aren’t confused, wdyt?
95-105
: Public-API doc out of sync with implementation
normalize()
advertises a second step “resolves and references schemas”, yet the schema-ref logic was removed in this PR. Could we update the doc-string (and maybe the numbered list) to reflect the new flow, wdyt?
156-161
: Redundant localimport copy
copy
is already imported at module level (line 5). Dropping the in-function import would avoid shadowing and keeps import style consistent. Patch example – ok to apply?- import copy - - streams = self._normalized_manifest.get(STREAMS_TAG, []) - # Use deep copy for comparison to avoid mutation issues - stream_copies = [copy.deepcopy(s) for s in streams] + streams = self._normalized_manifest.get(STREAMS_TAG, []) + # Use deep copy for comparison to avoid mutation issues + stream_copies = [copy.deepcopy(s) for s in streams]unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (1)
51-82
: Tests rely on private helpersThe new unit tests call the private methods
_clean_dangling_fields
and_replace_parent_streams_with_refs
directly. Would it be safer to exercise these behaviours via the publicnormalize()
API (or expose a public hook) so refactors don’t force test rewrites, wdyt?unit_tests/sources/declarative/parsers/conftest.py (1)
146-171
: Minor fixture optimisationThe repeated JSON-schema boilerplate (
"$schema": …
,additionalProperties
) appears verbatim in every inline schema. Maybe extracting it into a shared constant or usingpytest
parametrize could shrink the fixture and improve readability – open to it?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py
(3 hunks)unit_tests/sources/declarative/parsers/conftest.py
(6 hunks)unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml
(0 hunks)unit_tests/sources/declarative/parsers/test_manifest_normalizer.py
(1 hunks)
💤 Files with no reviewable changes (1)
- unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-amplitude
- GitHub Check: Check: source-intercom
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: SDM Docker Image Build
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (1)
200-215
: Potential false negatives on non-local$ref
s
find_all_refs
strips the first two chars (#/
) unconditionally. If a$ref
points outside the document (e.g."../common.yaml#/definitions/foo"
), the leading chars won’t be#/
, so the slice produces an off-by-two result. Should we guard withvalue.startswith("#/")
before slicing to avoid truncating external refs, wdyt?
PyTest Results (Full)3 678 tests +9 3 667 ✅ +9 17m 41s ⏱️ -53s Results for commit ff36d87. ± Comparison against base commit dcdd18b. This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (1)
150-192
: Efficient parent stream reference replacement logic.The hash-based lookup approach nicely addresses the previous performance concern about O(n²) comparisons. A few observations:
- The handling of both single router and list of routers is robust
- Only targeting
SubstreamPartitionRouter
type seems intentional and appropriate- The deep equality check via hashing ensures accurate matching
One question: should we consider adding a comment about the MD5 hash collision possibility, even though it's extremely low probability for this use case? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-amplitude
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (4)
111-111
: LGTM! Typo fix in method call.Good catch on fixing the typo from
_deduplicate_minifest
to_deduplicate_manifest
.
133-133
: LGTM! Typo fix in method signature.Consistent with the method call fix above.
143-146
: Well-structured normalization pipeline.The addition of
_replace_parent_streams_with_refs()
and_clean_dangling_fields()
creates a logical flow: handle duplicates → replace parent streams → clean up unused references. This aligns well with the PR objectives to simplify manifest structure while reducing duplication.
194-252
: Comprehensive cleanup logic with good performance characteristics.The implementation effectively finds all
$ref
paths and removes unreferenced definitions/schemas. I appreciate that it:
- Recursively traverses the entire manifest to find references
- Only operates on top-level keys (as per previous feedback for performance)
- Properly handles empty sections by removing them entirely
- Uses
startswith()
to handle nested reference pathsThe nested function approach keeps the logic clean and readable. The implementation looks solid for the cleanup requirements!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APPROVED
This makes sense to me, lets 🚢
What
This PR makes some changes to the
manifest_normalizer
logic to better align with how the SchemaForm-Based Builder needs things to be normalized.Notably, this PR:
schema_loader.schema
fields into theschemas
field and point to them with$refs
in the streams. The reason for this is that maintaining this structure in the frontend without issue would require complicating the UI logic more than I'd like to at this time. Since we expect most connector maintainers to be using the Builder UI once SchemaForm is fully released, it doesn't seem a big deal to start including the schema definition inline in the stream definition directly._replace_parent_streams_with_refs
, which finds the parent_stream_configs in streams and check if theirschema
field value matches another stream definition in the manifest, and if so replace it with a $ref pointing to that stream. This cuts down a lot on duplication in the manifests and the Builder UI will be updated in a later PR to properly handle the $ref for that field by showing a dropdown for selecting a stream using a UI override._clean_dangling_fields
, which looks for fields under thedefinitions
andschemas
manifest fields which don't have any $ref pointing to them anywhere in the manifest, which indicates that they are unused, and removes them from the manifest. This ensures unnecessary values are removed as manifests are transitioned to the new structure.Unit tests were adjusted to account for these changes.
Stay tuned for a separate Builder UI PR that will take advantage of these changes.
Summary by CodeRabbit
Refactor
Tests
Chores
Important
Auto-merge enabled.
This PR is set to merge automatically when all requirements are met.
Note
Auto-merge may have been disabled. Please check the PR status to confirm.