Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Nov 28, 2025

What

This PR addresses OOM (Out of Memory) errors during CHECK operations for the Azure Blob Storage source connector by:

  1. Increasing memory allocation for the check_connection operation to 4096Mi
  2. Adding an end_date filter to allow users to break down large backfills into smaller date ranges

A customer reported receiving exit code 137 during CHECK operations, which indicates the container was killed due to memory limits being exceeded. Even with increased memory (tested at 1600Mi and 4096Mi), the issue persisted for very large blob containers. The end_date filter provides a way to limit the scope of files being processed.

How

Memory increase:

  • Added resourceRequirements configuration to metadata.yaml targeting the check_connection job type with 4096Mi memory

End date filter:

  • Added optional end_date field to SourceAzureBlobStorageSpec with the same format as start_date
  • Added validator to ensure end_date is after start_date when both are provided
  • Updated stream_reader.py to skip files with last_modified > end_date during file enumeration
  • Updated documentation with the new configuration option

Review guide

  1. metadata.yaml - Memory configuration and version bump
  2. pyproject.toml - Version bump to 0.8.6
  3. source_azure_blob_storage/spec.py - New end_date field and validator
  4. source_azure_blob_storage/stream_reader.py - End date filtering logic in get_matching_files()
  5. docs/integrations/sources/azure-blob-storage.md - Documentation updates and changelog

User Impact

  • Users experiencing OOM errors (exit code 137) during connection checks can now use the end_date filter to limit the date range of files being processed
  • This allows breaking down large backfills into smaller, more manageable chunks
  • The end_date field is optional and backward compatible

Can this PR be safely reverted and rolled back?

  • YES 💚

Human review checklist

  • Verify end_date filtering logic correctly skips files with last_modified > end_date
  • Verify validator correctly ensures end_date >= start_date
  • Consider if unit tests should be added for the new end_date functionality
  • Confirm 4096Mi memory is appropriate as a baseline

Updates since last revision

  • Added end_date configuration field to spec with validation
  • Updated stream reader to filter files by end_date before yielding
  • Updated documentation (Cloud and OSS setup guides) to mention end_date option
  • Previous updates: Memory increased from 1600Mi to 4096Mi after initial testing

Requested by: Vai Ignatavicius (@vai-airbyte)

Link to Devin run: https://app.devin.ai/sessions/3e69b91978f642f28f34892c70280051

… to 4096Mi

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@devin-ai-integration
Copy link
Contributor Author

Original prompt from Vai
Received message in Slack channel #ask-devin-ai:

Hey @Devin  my customer is using Azure Blob Storage v0.8.5 connector and getting an error `Warning from source: The main container of the CHECK operation returned an exit code 137`
I checked the `source-azure-blob-storage` <https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-azure-blob-storage/metadata.yaml|metadata.yaml> file and do not see resourceRequirements configured, so I assume it uses the platform default config.
Can you please:
1. let me know what that default value is
2. Create a PR to increase the memory for check operation for this connector so we can test it using dev image.
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1764324479716909

@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Helpful Resources

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • /format-fix - Fixes most formatting issues.
  • /bump-version - Bumps connector versions.
    • You can specify a custom changelog by passing changelog. Example: /bump-version changelog="My cool update"
    • Leaving the changelog arg blank will auto-populate the changelog from the PR title.
  • /run-cat-tests - Runs legacy CAT tests (Connector Acceptance Tests)
  • /run-live-tests - Runs live tests for the modified connector(s).
  • /run-regression-tests - Runs regression tests for the modified connector(s).
  • /build-connector-images - Builds and publishes a pre-release docker image for the modified connector(s).
  • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-dev.{git-sha}) for all modified connectors in the PR.
  • JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
    • /bump-bulk-cdk-version bump=patch changelog='foo' - Bump the Bulk CDK's version. bump can be major/minor/patch.
  • Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.

📝 Edit this welcome message.

@devin-ai-integration
Copy link
Contributor Author

/bump-version type=patch

devin-ai-integration bot and others added 2 commits November 28, 2025 10:16
Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 28, 2025

source-azure-blob-storage Connector Test Results

37 tests   23 ✅  1m 35s ⏱️
 2 suites  14 💤
 2 files     0 ❌

Results for commit e74aeb1.

♻️ This comment has been updated with latest results.

@devin-ai-integration devin-ai-integration bot changed the title chore(source-azure-blob-storage): Increase memory for check operation to 4096Mi chore(source-azure-blob-storage): Increase memory for check operation to 1600Mi Nov 28, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 28, 2025

Deploy preview for airbyte-docs ready!

✅ Preview
https://airbyte-docs-kmcms2ofi-airbyte-growth.vercel.app

Built with commit e74aeb1.
This pull request is being automatically deployed with vercel-action

@vai-airbyte
Copy link
Contributor

vai-airbyte commented Nov 28, 2025

/publish-connectors-prerelease

Pre-release Connector Publish Started

Publishing pre-release builds for all modified connectors in this PR.
Branch: devin/1764324657-azure-blob-storage-memory

Pre-release versions will be tagged as {version}-dev.2d8dc16af7
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish: FAILED

See workflow run for details and published image tags.

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@devin-ai-integration devin-ai-integration bot changed the title chore(source-azure-blob-storage): Increase memory for check operation to 1600Mi chore(source-azure-blob-storage): Increase memory for check operation to 4096Mi Nov 28, 2025
…ckfills

- Add end_date configuration field to SourceAzureBlobStorageSpec
- Add validator to ensure end_date is after start_date
- Update stream reader to filter files by end_date
- Update documentation with end_date option

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@devin-ai-integration devin-ai-integration bot changed the title chore(source-azure-blob-storage): Increase memory for check operation to 4096Mi feat(source-azure-blob-storage): Add end_date filter and increase memory for check operation Nov 28, 2025
5. Optionally, enter the **Globs** which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use `**` as the pattern. For more precise pattern matching options, refer to the [Path Patterns section](#path-patterns) below.
10. (Optional) Enter the endpoint to use for the data replication.
11. (Optional) Enter the desired start date from which to begin replicating data.
12. (Optional) Enter the desired end date to stop replicating data. This is useful for breaking down large backfills into smaller date ranges.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD029/ol-prefix Ordered list item prefix [Expected: 10; Actual: 12; Style: 1/2/3]

vai-airbyte and others added 3 commits November 28, 2025 12:40
…parameter.

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
- Use 'value' instead of named parameter for validator
- Add allow_reuse=True to fix duplicate validator error

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
…age-memory' into devin/1764324657-azure-blob-storage-memory - resolve conflict with correct validator signature

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
)

@validator("end_date", allow_reuse=True)
def validate_end_date(cls, value: Optional[str], values: Dict[str, Any]) -> Optional[str]:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants