-
Notifications
You must be signed in to change notification settings - Fork 24
fix(async-retriever): fix calling transformations twice #576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe Changes
Sequence Diagram(s)sequenceDiagram
participant Factory as ModelToComponentFactory
participant DownloadRetriever
participant AsyncRetriever
Factory->>DownloadRetriever: Construct with record_selector (no transformations, no filter, no schema normalization)
Factory->>AsyncRetriever: Construct with its own record_selector (handles transformations, filter, normalization)
Would you like a more detailed breakdown of how the record selector responsibilities are split between the download retriever and the async retriever, wdyt? Possibly related PRs
Suggested labels
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (6)
🔇 Additional comments (2)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)
3883-3892
: Great test coverage for the transformation fix! Could we clean up a few minor issues, wdyt?The assertions perfectly validate that:
- Transformations are applied to the async retriever's record_selector
- The download retriever's record_selector has no transformations, filters, or schema normalization
A couple of minor suggestions:
- # Validate the transformations are just passed to the async retriever record_selector but not the download retriever record_selector + # Validate transformations are passed to async retriever record_selector + # but not the download retriever record_selectorThe line length warning is valid - could we break this comment into two lines to stay under 100 characters?
Also, accessing
_config
is a protected member, but this is common in unit tests for validation purposes, so it's probably fine here.🧰 Tools
🪛 Pylint (3.3.7)
[convention] 3883-3883: Line too long (136/100)
(C0301)
[warning] 3891-3891: Access to a protected member _config of a client class
(W0212)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py
(2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py
[convention] 3883-3883: Line too long (136/100)
(C0301)
[warning] 3891-3891: Access to a protected member _config of a client class
(W0212)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (2)
3831-3845
: LGTM! Clean transformation setup for testing.The transformations definition follows the established pattern from other tests in this file and creates a simple AddFields transformation that's perfect for validating the fix.
3853-3853
: Good addition to enable transformation testing.Adding the transformations parameter allows the test to validate the fix for the double transformation issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it was done as part of this PR. I see the reason for the extra_fields
stuff but I don't see the reason for the transformations here. Was there a bug when Artem added this? @brianjlai do you have more context on this as the reviewer?
@brianjlai if it helps I see here @artem1205 was working on "identified a gap in the functionality of AsyncRetriever needed for amazon ads". Not sure if there was a track for such gap. |
The only thing I can think of was that while fixing the other bug, he noticed that we weren't passing the transformations into the selector and thought it was another oversight so he added it in. But I don't think any of us questioned it and the oversight is we all didn't consider that the AsyncRetriever was responsible for owning the selector and transformation of records. I think we're fine to make this change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️
What
The async retriever is calling the record transformations twice. Additionally, it appears that the record_selector is duplicated, with one instance not including the record_filter, and then it performs transformations on unfiltered data, which is undesirable.
Yes, but why?
Async retriever requires a download retriever that under the hood is a SimpleRetriever with a record_selector, to this record_selector we pass transformations, but not schema normalization, nor filters.
Async retriever also holds a record_selector with transformations, schema normalization, and filters (all of those if applied in the manifest definition).
Then both record_selectors try to filter_and_transform, causing transformations to apply twice, but with no filters for the record_selector in the download retriever (simple retriever).
What can we do?
If we don't provide schema normalization or filters, then we don't need to apply transformations to this other record_selector unless I'm missing something obvious.
Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/13200
Summary by CodeRabbit