-
Notifications
You must be signed in to change notification settings - Fork 5k
feat(source-azure-blob-storage): Add end_date filter and increase memory for check operation #70246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(source-azure-blob-storage): Add end_date filter and increase memory for check operation #70246
Conversation
… to 4096Mi Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
Original prompt from Vai |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Helpful Resources
PR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
|
|
/bump-version type=patch |
Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
|
|
Deploy preview for airbyte-docs ready! ✅ Preview Built with commit e74aeb1. |
|
/publish-connectors-prerelease
|
Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
…ckfills - Add end_date configuration field to SourceAzureBlobStorageSpec - Add validator to ensure end_date is after start_date - Update stream reader to filter files by end_date - Update documentation with end_date option Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
| 5. Optionally, enter the **Globs** which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use `**` as the pattern. For more precise pattern matching options, refer to the [Path Patterns section](#path-patterns) below. | ||
| 10. (Optional) Enter the endpoint to use for the data replication. | ||
| 11. (Optional) Enter the desired start date from which to begin replicating data. | ||
| 12. (Optional) Enter the desired end date to stop replicating data. This is useful for breaking down large backfills into smaller date ranges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD029/ol-prefix Ordered list item prefix [Expected: 10; Actual: 12; Style: 1/2/3]
airbyte-integrations/connectors/source-azure-blob-storage/source_azure_blob_storage/spec.py
Fixed
Show fixed
Hide fixed
…parameter. Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
- Use 'value' instead of named parameter for validator - Add allow_reuse=True to fix duplicate validator error Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
…age-memory' into devin/1764324657-azure-blob-storage-memory - resolve conflict with correct validator signature Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
What
This PR addresses OOM (Out of Memory) errors during CHECK operations for the Azure Blob Storage source connector by:
check_connectionoperation to 4096Miend_datefilter to allow users to break down large backfills into smaller date rangesA customer reported receiving
exit code 137during CHECK operations, which indicates the container was killed due to memory limits being exceeded. Even with increased memory (tested at 1600Mi and 4096Mi), the issue persisted for very large blob containers. Theend_datefilter provides a way to limit the scope of files being processed.How
Memory increase:
resourceRequirementsconfiguration tometadata.yamltargeting thecheck_connectionjob type with 4096Mi memoryEnd date filter:
end_datefield toSourceAzureBlobStorageSpecwith the same format asstart_dateend_dateis afterstart_datewhen both are providedstream_reader.pyto skip files withlast_modified > end_dateduring file enumerationReview guide
metadata.yaml- Memory configuration and version bumppyproject.toml- Version bump to 0.8.6source_azure_blob_storage/spec.py- Newend_datefield and validatorsource_azure_blob_storage/stream_reader.py- End date filtering logic inget_matching_files()docs/integrations/sources/azure-blob-storage.md- Documentation updates and changelogUser Impact
end_datefilter to limit the date range of files being processedend_datefield is optional and backward compatibleCan this PR be safely reverted and rolled back?
Human review checklist
last_modified > end_dateUpdates since last revision
end_dateconfiguration field to spec with validationRequested by: Vai Ignatavicius (@vai-airbyte)
Link to Devin run: https://app.devin.ai/sessions/3e69b91978f642f28f34892c70280051