-
Notifications
You must be signed in to change notification settings - Fork 24
fix: (CDK) (Manifest) - Add Manifest Normalization
module (reduce commonalities + handle schema $refs)
#447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
airbyte_cdk/sources/declarative/parsers/manifest_reference_resolver.py
Outdated
Show resolved
Hide resolved
@bazarnov as discussed on our call, here are some examples of input YAML manifest and expected output: First input
First output
Second input (in this case there are more occurrences of "someotherapi" than "pokeapi", so "someotherapi" is the one that gets deduped and "pokeapi" is left in a duplicated state)
Second output
And here is an example of if the shareable fields are already being shared: Input
You can see in the output, we don't try to deduplicate that field in that case, since it is already pointing to Output
|
📝 WalkthroughWalkthroughThis update introduces manifest normalization to the declarative source pipeline. A new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Handler as connector_builder_handler.py
participant Source as ManifestDeclarativeSource
participant Normalizer as ManifestNormalizer
User->>Handler: Provide config (may include __should_normalize)
Handler->>Handler: should_normalize_manifest(config)
Handler->>Source: create_source(config, normalize_manifest)
Source->>Source: Load declarative schema
alt normalize_manifest is True
Source->>Normalizer: normalize(manifest, schema)
Normalizer-->>Source: Normalized manifest
else normalize_manifest is False
Source->>Source: Use manifest as-is
end
Source-->>Handler: Source instance ready
Possibly related PRs
Suggested labels
Suggested reviewers
Would you like me to suggest adding any other reviewers or labels, or does this set look just right to you? Wdyt? ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (7)
airbyte_cdk/sources/declarative/parsers/custom_exceptions.py (1)
24-30
: Good addition of a specialized exception for manifest normalization!This new exception class
ManifestNormalizationException
will help differentiate normalization errors from other types of exceptions, making error handling more precise. The docstring mentions "circular reference" though - should it be more specific to deduplication failures instead? Wdyt?- """ - Raised when a circular reference is detected in a manifest. - """ + """ + Raised when manifest normalization fails, such as during deduplication. + """airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
2187-2195
: Minor grammar tweak on thepath
description?
The updated description reads “The Path the specific API endpoint that this stream represents…”—could we make it “The path of the specific API endpoint that this stream represents” for extra clarity? wdyt?unit_tests/sources/declarative/parsers/conftest.py (1)
10-173
: Fixture duplication & maintenance burden – consider trimming test data?The fixture
manifest_with_multiple_url_base()
embeds five almost‑identical stream definitions differing only byname
/path
and the requester reference. That makes the file bulky and hard to scan.
Would it be lighter to generate these streams programmatically inside the fixture (e.g. via a small helper) so future changes touch fewer lines, wdyt?airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)
123-132
: Fail‑open fallback may mask bugsIf
ManifestNormalizer
raises, the code silently keeps the un‑normalized manifest. Could we at least log the exception atDEBUG
so issues do not go unnoticed, wdyt?unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (1)
6-14
: Tests depend on a “private” helperImporting
_get_declarative_component_schema()
ties the test to a non‑public API.
Would exposing a tiny public helper (or loading the YAML inside the test) make the contract clearer and avoid name‑mangling underscore imports, wdyt?airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (2)
72-80
: Typos in class docstring & method nameSmall polish point: appliying, normilize() and the method
_deduplicate_minifest
look misspelled.
Renaming the method to_deduplicate_manifest
(and updating the call site) plus fixing the docstring would aid readability, wdyt?
361-363
: Hashing viajson.dumps
– any concern about ordering of lists?
json.dumps(sort_keys=True)
guarantees key order for dicts but not element order for lists, so two logically identical lists with different ordering won’t be deduped.
Is that acceptable for your use‑case, or should we normalise lists (e.g., sort when elements are primitives) before hashing, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
airbyte_cdk/connector_builder/connector_builder_handler.py
(1 hunks)airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(3 hunks)airbyte_cdk/sources/declarative/manifest_declarative_source.py
(12 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/custom_exceptions.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py
(2 hunks)airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/manifest_reference_resolver.py
(2 hunks)unit_tests/connector_builder/test_connector_builder_handler.py
(3 hunks)unit_tests/sources/declarative/parsers/conftest.py
(1 hunks)unit_tests/sources/declarative/parsers/test_manifest_normalizer.py
(1 hunks)unit_tests/sources/declarative/parsers/test_manifest_reference_resolver.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Check: 'source-shopify' (skip=false)
🔇 Additional comments (15)
unit_tests/sources/declarative/parsers/test_manifest_reference_resolver.py (1)
154-157
: Nice code formatting improvement!This change improves readability by breaking the dictionary definition across multiple lines instead of having it on a single line. It makes the structure easier to scan visually while maintaining the same functionality.
unit_tests/connector_builder/test_connector_builder_handler.py (3)
500-506
: Good formatting improvement!The arguments to
handle_request
are now more readable, with each argument on its own line with proper indentation. This is consistent with Python coding best practices for readability.
518-524
: Consistent formatting applied here too, looks good!This matches the formatting pattern applied to the other
handle_request
call, maintaining consistency throughout the codebase.
928-934
: Consistent formatting applied here as well!The same multi-line argument formatting is applied to this
handle_request
call, completing the consistent pattern across the file.airbyte_cdk/sources/declarative/parsers/manifest_reference_resolver.py (2)
6-6
: Good type import addition!Adding the
Dict
import from typing to support the updated return type annotation.
102-102
: More precise return type annotation!Changed the return type from
Mapping[str, Any]
toDict[str, Any]
, which is more specific and accurately reflects what the method actually returns. This provides better type information for static analysis tools and developers.airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
2177-2186
: Alignurl_base
description with schema YAML
Great job adding the definite article to theurl_base
description so it now matches the declarative schema in the YAML—this consistency will help both users and the normalizer pick up the right metadata. wdyt?airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)
1872-1874
: Addedlinkable
property tourl_base
- looks great!The addition of
linkable: true
here is a key part of the manifest normalization feature, marking the URL base as a candidate for deduplication and reference-based sharing across streams. The title and description improvements also enhance clarity - nice touch adding "The" for more consistent documentation style, wdyt?
1891-1893
: Improved description formattingSimilar style improvement adding "The" to the path description for consistency. Good attention to detail in maintaining a consistent documentation style throughout the schema.
1908-1909
: Addedlinkable
property toauthenticator
- great!Adding the
linkable: true
attribute to the authenticator property follows the same pattern as forurl_base
, allowing the new normalization module to identify and deduplicate shared authenticator configurations. This will help reduce duplication in manifests where multiple streams use the same authentication method.airbyte_cdk/connector_builder/connector_builder_handler.py (2)
59-66
: Well-implemented check for manifest normalization flagThis new function cleanly encapsulates the logic for determining whether manifest normalization should be applied. The docstring is clear and descriptive, and the implementation follows good practices by providing a default value (False) if the flag is not present.
74-74
: Correctly integrated normalization flag with source creationThe function call to
should_normalize_manifest(config)
is appropriately placed to pass the normalization flag to theManifestDeclarativeSource
constructor. This integrates the new normalization feature while maintaining backward compatibility by defaulting to False.airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py (2)
7-7
: Improved import organizationThe import statement now includes
Dict
explicitly, which is needed for the return type annotation update below. Good practice to include all required types in the imports.
98-98
: More specific return type annotationChanging the return type from
Mapping[str, Any]
toDict[str, Any]
provides more specificity about the actual return value. This matches what the method actually returns (adict
instance), making the type hints more accurate. This kind of precision in type annotations improves type checking and IDE support.airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (1)
146-164
:_prepare_definitions
deletes every non‑linked
entryPurging
definitions.*
exceptlinked
works for the current UI flow, but could silently drop user‑defined reusable objects (e.g. auth configs) that are still referenced elsewhere.
Should we restrict the pruning to keys we know are safe to remove (e.g.streams
,requester_*
) or at least gate it behind an explicit flag, wdyt?
@bazarnov I tried testing this locally by running my local airbyte-platform-internal repo on my FE PR's branch, pointing to this PR's CDK branch. When I opened the Builder in YAML mode and put in this manifest: manifest.yaml
Then I got back this error from the
So it seems like the CDK is trying to validate the manifest again after you have deduplicated the I think the behavior I expected is that the
With this approach, we shouldn't ever try to validate the result of deduplicating the manifest against the json schema, as we will always expect that to fail. |
…ocessing > type propagation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)
65-76
: Reuse the parsed schema vialru_cache
(same suggestion as last time 👍)The helper reloads & parses the YAML on every
ManifestDeclarativeSource
instantiation. Given that the schema is immutable at runtime, would you like to decorate_get_declarative_component_schema()
with@functools.lru_cache(maxsize=1)
(and importfunctools
) so the IO happens only once, spreading the cost across all sources, wdyt?While you are touching the helper, you could also switch from the more verbose
yaml.load(..., Loader=yaml.SafeLoader)
to the dedicatedyaml.safe_load(...)
helper which conveys the intent more clearly.Example patch:
-from copy import deepcopy +from copy import deepcopy +from functools import lru_cache ... -import yaml +import yaml ... -def _get_declarative_component_schema() -> Dict[str, Any]: +@lru_cache(maxsize=1) +def _get_declarative_component_schema() -> Dict[str, Any]: ... - declarative_component_schema = yaml.load(raw_component_schema, Loader=yaml.SafeLoader) + declarative_component_schema = yaml.safe_load(raw_component_schema)
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (2)
94-108
: Parameter typing –bool
rather thanOptional[bool]
?
normalize_manifest
has a default value ofFalse
, which means the argument can never beNone
in practice. Would it be clearer (and help static analysers) to type‑hint it simply asbool = False
and drop theOptional
wrapper in both the signature and the attribute definition, wdyt?- normalize_manifest: Optional[bool] = False, + normalize_manifest: bool = False, ... - self._should_normalize = normalize_manifest + self._should_normalize: bool = normalize_manifest
148-161
: Docstring says “Returns None” but the function returns a manifest
_preprocess_manifest()
returnspropagated_manifest
, yet the docstring claims it returnsNone
. Updating the docstring (and possibly the return‑type hint) would avoid confusion for future readers. Something like:- Returns: - None + Returns: + Dict[str, Any]: the fully resolved and propagated manifestwdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/manifest_declarative_source.py
(11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
@bazarnov that original issue from my comment above is now fixed. However, I have encountered another issue here, with this manifest put into YAML mode of Builder UI pointed at this branch: manifest.yaml
Notice how in this manifest:
For this manifest, I would expect this branch's Expected response
where:
But what I actually received was this: Actual response
These are the things wrong with the actual response:
|
@lmossman The input manifest is incorrect: The schemas:
berries:
type: object
$schema: http://json-schema.org/draft-07/schema#
properties: {}
additionalProperties: true
pokemon:
type: object
$schema: http://json-schema.org/draft-07/schema#
properties: {}
additionalProperties: true
location:
type: object
$schema: http://json-schema.org/draft-07/schema#
properties: {}
additionalProperties: true
trainers:
type: object
$schema: http://json-schema.org/draft-07/schema#
properties: {}
additionalProperties: true
items:
type: object
$schema: http://json-schema.org/draft-07/schema#
properties: {}
additionalProperties: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (2)
312-321
:_collect
assumes parent objects always have"type"
_get_occurance_samples
accessesobj["type"]
; if a duplicate primitive sits under an object lacking that field, aKeyError
bubbles up and aborts normalization.
Would addingobj.get("type", "Unknown")
(or skipping such occurrences) make the process more robust, wdyt?- return ( - obj["type"], + return ( + obj.get("type", "Unknown"),
432-436
: 🛠️ Refactor suggestionConsider error handling for key access in _get_occurance_samples
This function assumes that
obj
will always have a"type"
key, but that might not be the case for all objects. Consider usingobj.get("type", "Unknown")
to handle cases where the key is missing.return ( - obj["type"], + obj.get("type", "Unknown"), path[-1], value, ) # Return the component's name as the last part of its path
🧹 Nitpick comments (8)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (5)
111-111
: Is there a typo in this method name?The method name
_deduplicate_minifest()
appears to contain a typo - I think you meant "manifest" instead of "minifest", wdyt?- self._deduplicate_minifest() + self._deduplicate_manifest()
134-146
: Method name should match its usageThis method definition contains the same typo as its invocation.
- def _deduplicate_minifest(self) -> None: + def _deduplicate_manifest(self) -> None:
117-118
: Consider adding actual debug loggingThe TODO comment indicates debug logging should be enabled, but no logging implementation exists yet. Consider adding proper debug logging using Python's
logging
module to help with troubleshooting.+import logging + +# At the beginning of the class + def __init__( + self, + resolved_manifest: ManifestType, + declarative_schema: DefinitionsType, + ) -> None: + self._logger = logging.getLogger("airbyte.sources.declarative.parsers.manifest_normalizer") + ... # In the normalize method - # if any error occurs, we just return the original manifest. - # TODO: enable debug logging + # if any error occurs, we just return the original manifest + self._logger.debug("Manifest normalization failed", exc_info=True)
400-418
: Parameter in docstring doesn't match method signatureThe docstring refers to a
definitions
parameter that doesn't exist in the method signature. Consider updating the docstring to match the actual method parameters.""" Get the value of a linked definition by its key. Args: key: The key to check - definitions: The definitions dictionary with definitions + type_key: The type key in linked definitions Returns: The value of the linked definition """
18-19
: Typo in class docstringThere's a small typo in the class docstring - "appliying" should be "applying".
""" - This class is responsible for normalizing the manifest by appliying processing such as: + This class is responsible for normalizing the manifest by applying processing such as: - removing duplicated definitions - replacing them with references.unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml (3)
212-212
: Add newline at end of fileStatic analysis detected that there's no newline character at the end of the file. It's a standard best practice to end files with a newline character.
location_type: type: string +
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 212-212: no new line character at the end of file
(new-line-at-end-of-file)
28-34
: Could authenticator also be in linked definitions?I notice that all streams use the same
ApiKeyAuthenticator
configuration. Have you considered moving this to the linked definitions as well to reduce duplication? The normalizer should handle this, but it might be clearer to define it explicitly if it's intentionally shared.
198-205
: MissingadditionalProperties
in pokemon schemaThe
pokemon
schema doesn't explicitly define whether additional properties are allowed, while the other schemas do (additionalProperties: true
). For consistency, would it make sense to add this?pokemon_type: type: integer + additionalProperties: true
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py
(1 hunks)unit_tests/sources/declarative/parsers/conftest.py
(1 hunks)unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml
(1 hunks)unit_tests/sources/declarative/parsers/test_manifest_normalizer.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- unit_tests/sources/declarative/parsers/conftest.py
- unit_tests/sources/declarative/parsers/test_manifest_normalizer.py
🧰 Additional context used
🪛 YAMLlint (1.35.1)
unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml
[error] 212-212: no new line character at the end of file
(new-line-at-end-of-file)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/parsers/manifest_normalizer.py (3)
89-90
: LGTM: Proper initialization of fieldsThe initialization properly makes a deep copy of the resolved manifest, initializes class variables, and retrieves the linkable tags from the schema. This ensures your class doesn't accidentally modify the input manifests.
110-115
: LGTM: Error handling in normalize() methodThe normalize method has proper error handling that falls back to the original manifest if any exceptions occur during normalization. This is a good defensive programming practice that ensures the system remains robust even if normalization fails.
317-320
: LGTM: Proper handling of nested dictionariesThe recursive traversal in
_collect
correctly processes nested dictionaries first and then checks for linkable keys. This approach ensures that the entire object structure is evaluated properly.unit_tests/sources/declarative/parsers/resources/abnormal_schemas_manifest.yaml (2)
10-14
: LGTM: Good use of linked definitionsThe manifest properly defines linked definitions for the
HttpRequester
with a sharedurl_base
. This structure will work well with the manifest normalizer to encourage reuse across streams.
17-43
: LGTM: Well-structured stream definitionThe pokemon stream is well-structured with proper configuration of the retriever, decoder, requester, authenticator, and record selector. The schema is also properly referenced using
$ref
.
Schemas are now extracted for each stream and set under the |
The behavior looks good to me now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (4)
16-16
: Should this resolver be a fixture instead of a global variable?The
resolver
is defined at the module level. Would it make more sense to define it as a pytest fixture to better control its lifecycle and make the tests more isolated? wdyt?
19-33
: Test function looks good with clear purpose!The test is well-structured with descriptive naming and a helpful docstring. It follows a clear pattern of:
- Getting the schema
- Preprocessing the manifest
- Normalizing the manifest
- Asserting equality with the expected result
One small suggestion: would adding type hints for the function parameters improve readability? wdyt?
def test_when_multiple_url_base_are_resolved_and_most_frequent_is_shared( - manifest_with_multiple_url_base, - expected_manifest_with_multiple_url_base_normalized, + manifest_with_multiple_url_base: dict, + expected_manifest_with_multiple_url_base_normalized: dict, ) -> None:
51-73
: The third test covers a complex case but could use more clarityThe test is well-structured and covers an important case. The docstring provides good context but the term "abnormal schemas" might not be immediately clear to readers unfamiliar with the codebase.
Based on previous review comments, I understand this refers to cases where multiple streams reference the same schema. Would adding a brief explanation about this in the docstring be helpful for future maintainers? wdyt?
def test_with_linked_definitions_url_base_authenticator_when_multiple_streams_reference_the_same_schema( manifest_with_linked_definitions_url_base_authenticator_abnormal_schemas, expected_manifest_with_linked_definitions_url_base_authenticator_normalized, ) -> None: """ This test is to check that the manifest is normalized when the `url_base` and the `authenticator` is linked between the definitions and the `url_base` is present in the manifest. The `authenticator` is not a normal schema, but a reference to another schema. The test also verifies the `stream.schema_loader.schema` is properly extracted to the `schemas.<stream_name>`. + + "Abnormal schemas" refers to a case where not all streams have their own unique schema definition. + In this test, 5 streams are declared but only 3 schemas are defined (3 streams re-use the same schema, + while 2 other streams use 2 different schemas). """
1-74
: Consider adding a module-level docstringThis file would benefit from a module-level docstring explaining its purpose and relation to the manifest normalization feature. This would help future developers understand the test suite's overall purpose.
# # Copyright (c) 2023 Airbyte, Inc., all rights reserved. # +""" +Unit tests for the ManifestNormalizer, which deduplicates and restructures declarative manifests +by extracting commonly used definitions into a `definitions.linked` section. +""" from airbyte_cdk.sources.declarative.manifest_declarative_source import (
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (3)
unit_tests/sources/declarative/parsers/test_manifest_normalizer.py (3)
1-14
: Imports and setup look good!The imports are well-organized and the module is properly structured. The copyright header is present and correctly formatted.
35-49
: Second test follows good practices!This test follows the same pattern as the first one with clear naming and documentation. The same type hint suggestion applies here as well.
1-74
: Missing tests for error casesThe PR mentions handling circular references and errors during normalization via a new
ManifestNormalizationException
. Should there be tests for those error cases to ensure the exception is properly raised when circular references are detected? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)
181-192
: Should we wrap normalization errors to keep exception surfacing consistent?This block directly calls
ManifestNormalizer.normalize()
, allowing aManifestNormalizationException
to bubble up. A previous review suggested catching it and re-raising aValidationError
(orAirbyteTracedException
) so callers see a familiar error type. That suggestion still applies here. Would you like to adopt that wrapper for a smoother DX, wdyt?
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (2)
148-171
: Docstring return type & description seem out of sync with implementationThe docstring states that
_preprocess_manifest
“modifies the input manifest in place” and “Returns: None”, yet the function actually returns a newDict[str, Any]
instance (propagated_manifest
). Could we update the docstring (or the implementation) so the contract is crystal-clear, wdyt?
94-108
: Minor typing nit — do we needOptional
for a boolean flag that defaults toFalse
?
normalize_manifest: Optional[bool] = False
effectively narrows tobool
becauseNone
is indistinguishable fromFalse
downstream. Switching to a plainbool
keeps the signature simpler (and static checkers slightly happier). Something like:- normalize_manifest: Optional[bool] = False, + normalize_manifest: bool = False,No behaviour change, just a little clarity—what do you think?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(3 hunks)airbyte_cdk/sources/declarative/manifest_declarative_source.py
(11 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/declarative/declarative_component_schema.yaml
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)
65-81
: Could we memo-ize the schema load to avoid repeated I/O, wdyt?
_get_declarative_component_schema()
is executed for everyManifestDeclarativeSource
instantiation, which means the YAML file is re-read and parsed each time. Considering the schema is immutable at runtime, wrapping the function with@functools.lru_cache(maxsize=1)
(or promoting it to a module-level constant) would give us free caching while keeping the public surface untouched.+from functools import lru_cache ... - def _get_declarative_component_schema() -> Dict[str, Any]: +@lru_cache(maxsize=1) +def _get_declarative_component_schema() -> Dict[str, Any]:This micro-optimization removes unnecessary disk I/O and YAML parsing without changing call sites.
[ suggest_optional_refactor ]
What
Resolves:
How
Normalization of Declarative Manifest:
should_normalize_manifest
function to check if thenormalization
is needed, which is then applied in theManifestDeclarativeSource
class.ManifestNormalizer
todeduplicate
andrestructure
themanifest
, extracting commonly used definitions into adefinitions.linked
section.Declarative Component Schema:
linkable
property to schema definitions forshared
components likeurl_base
andauthenticator
, as of the current state of theForm-Based UI
project (MVP)Error Handling:
ManifestNormalizationException
for handling issues during normalization such as circular references.Unit Tests Enhancements:
These changes collectively enhance the functionality, reliability, and maintainability of the connector builder and declarative manifest handling in the Airbyte Python CDK.
User Impact
normalize_manifest: bool = False
by default, to not to have any regressions, before we're ready to use it within the UI (/resolve
should be havingnormalize: bool
flag to set the normalization to True)Related PRs
Manifest Migrations
- fix: (PoC) (do not merge) (CDK) (Manifest) - Migrate manifest fields #463Summary by CodeRabbit
Summary by CodeRabbit
New Features
Documentation
Style
Tests