-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Apify Dataset: fix broken stream, manifest refactor #30428
Conversation
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
airbyte-integrations/connectors/source-apify-dataset/metadata.yaml
Outdated
Show resolved
Hide resolved
...ations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection_wcc.json
Outdated
Show resolved
Hide resolved
...ations/connectors/source-apify-dataset/source_apify_dataset/schemas/item_collection_wcc.json
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-apify-dataset/source_apify_dataset/manifest.yaml
Outdated
Show resolved
Hide resolved
Thanks for the review @marcosmarxm , the code is updated. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Thanks @vdusek I fixed a few problems, there is one thing remaining:
Do you think we should sync these properties (if yes we should add them to the schema)? If not we can also remove them with a remove transformation in the manifest (like here:
|
Hi @flash1293, in the WCC datasets, there should be no properties such as However, these fields are in the old schema (source_apify_dataset/schemas/item_collection.json). So it might have something to do with it. Here is an example of the WCC dataset: [
{
"url": "https://docs.apify.com/academy/web-scraping-for-beginners",
"crawl": {
"loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"loadedTime": "2023-09-08T15:11:20.522Z",
"referrerUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"depth": 0,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"title": "Web scraping for beginners | Academy | Apify Documentation",
"description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.",
"author": null,
"keywords": null,
"languageCode": "en"
},
"screenshotUrl": null,
"text": "Web scraping for beginners...",
"markdown": "## Web scraping for beginners..."
},
{
"url": "https://docs.apify.com/academy/web-scraping-for-beginners/introduction",
"crawl": {
"loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners/introduction",
"loadedTime": "2023-09-08T15:11:32.622Z",
"referrerUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners/introduction",
"title": "Introduction | Academy | Apify Documentation",
"description": "Start learning about web scraping, web crawling, data extraction, and popular tools to start developing your own scraper.",
"author": null,
"keywords": null,
"languageCode": "en"
},
"screenshotUrl": null,
"text": "Introduction\nStart learning about web scraping, ..."
},
"...": "..."
] If I understand it correctly, you tested it with some dataset, and that dataset produced these fields. So it is probably some invalid dataset for this Stream (produced by another Actor). So I'd say we can go with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdusek I changed the acceptance tests to actually run on a WCC dataset. I had to adjust the schemas a bit to make it work (some fields were missing in the provided schemas), but it's passing now
What
spec.yaml
andmanifest.yaml
, ...How
Recommended reading order
🚨 User Impact 🚨
Pre-merge Actions
Updating a connector
Community member or Airbyter
Airbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.