-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: Azure blob batch export #44627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Remove undefined azurite-data volume from docker-compose configs - Add try/finally to ensure Azure client cleanup on fixture teardown - Add non-empty check for expected_session_ids in test assertion - Add UnsupportedCompressionError with proper error handling
- Add feature flag gating for Azure Blob destination - Extract common utility functions to destinations/utils.py (get_key_prefix, get_manifest_key) - Remove Azurite from main docker-compose files - Switch tests to assert_clickhouse_records_in_azure_blob pattern (matching S3/BigQuery/Postgres) to properly account for export pipeline schema transformations
… into shared get_object_key function
- Move S3 key generation tests from s3/test_utils.py to shared test_utils.py - Update README to include batch-exports compose file in docker command
Start Azurite container in backend tests for Azure Blob Storage tests.
|
We certainly don't need that many reviewers holy moly 😅 |
Migration SQL ChangesHey 👋, we've detected some migrations on this PR. Here's the SQL output for each migration, make sure they make sense:
|
🔍 Migration Risk AnalysisWe've analyzed your migrations for potential risks. Summary: 0 Safe | 1 Needs Review | 0 Blocked
|
lol, it's a no-op migration, how is it not safe? |
Visual regression: Storybook UI snapshots updatedChanges: 5 snapshots (5 modified, 0 added, 0 deleted) What this means:
Next steps:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 issues found across 35 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="products/batch_exports/backend/temporal/destinations/utils.py">
<violation number="1" location="products/batch_exports/backend/temporal/destinations/utils.py:49">
P1: When `data_interval_start` is `None`, this will produce manifest keys like `"None-2024-01-01_manifest.json"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</violation>
<violation number="2" location="products/batch_exports/backend/temporal/destinations/utils.py:65">
P1: When `data_interval_start` is `None`, this will produce object keys with base names like `"None-2024-01-01"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</violation>
</file>
<file name="products/batch_exports/backend/tests/docker-compose.yml">
<violation number="1" location="products/batch_exports/backend/tests/docker-compose.yml:7">
P2: Typo in usage example: `producs` should be `products`. This would cause the command to fail if copied directly.</violation>
</file>
<file name="products/batch_exports/backend/tests/temporal/destinations/azure_blob/conftest.py">
<violation number="1" location="products/batch_exports/backend/tests/temporal/destinations/azure_blob/conftest.py:57">
P2: If `create_container()` raises an exception, the `BlobServiceClient` connection will never be closed since `client.close()` is only in the finally block after `yield`. Consider using async context manager for proper resource management.</violation>
</file>
<file name="products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py">
<violation number="1" location="products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py:156">
P2: Normalize `data_interval_start` to a stable sentinel (e.g. "START") when generating blob/manifest keys; otherwise backfills that set `data_interval_start=None` will create filenames starting with `None-...`.</violation>
<violation number="2" location="products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py:276">
P2: Missing schema normalization that exists in all other batch export destinations. After `wait_for_schema_or_producer`, other destinations normalize the schema to make all fields nullable to handle inconsistent nullability between batches. This could cause failures when processing data with mixed nullability constraints.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| """Generate object storage key for batch export files.""" | ||
| key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model) | ||
|
|
||
| base_file_name = f"{data_interval_start}-{data_interval_end}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: When data_interval_start is None, this will produce object keys with base names like "None-2024-01-01" instead of using a sensible fallback like "START". This is inconsistent with the handling in get_allowed_template_variables which uses data_interval_start or "START".
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/utils.py, line 65:
<comment>When `data_interval_start` is `None`, this will produce object keys with base names like `"None-2024-01-01"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</comment>
<file context>
@@ -1,4 +1,83 @@
+ """Generate object storage key for batch export files."""
+ key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model)
+
+ base_file_name = f"{data_interval_start}-{data_interval_end}"
+
+ if include_file_number:
</file context>
| ) -> str: | ||
| """Generate manifest file key.""" | ||
| key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model) | ||
| return posixpath.join(key_prefix, f"{data_interval_start}-{data_interval_end}_manifest.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: When data_interval_start is None, this will produce manifest keys like "None-2024-01-01_manifest.json" instead of using a sensible fallback like "START". This is inconsistent with the handling in get_allowed_template_variables which uses data_interval_start or "START".
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/utils.py, line 49:
<comment>When `data_interval_start` is `None`, this will produce manifest keys like `"None-2024-01-01_manifest.json"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</comment>
<file context>
@@ -1,4 +1,83 @@
+) -> str:
+ """Generate manifest file key."""
+ key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model)
+ return posixpath.join(key_prefix, f"{data_interval_start}-{data_interval_end}_manifest.json")
+
+
</file context>
| # https://posthog.com/handbook/engineering/developing-locally | ||
| # | ||
| # Usage: | ||
| # docker compose -f producs/batch_exports/backend/tests/docker-compose.yml up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Typo in usage example: producs should be products. This would cause the command to fail if copied directly.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/tests/docker-compose.yml, line 7:
<comment>Typo in usage example: `producs` should be `products`. This would cause the command to fail if copied directly.</comment>
<file context>
@@ -0,0 +1,17 @@
+# https://posthog.com/handbook/engineering/developing-locally
+#
+# Usage:
+# docker compose -f producs/batch_exports/backend/tests/docker-compose.yml up
+#
+
</file context>
| # docker compose -f producs/batch_exports/backend/tests/docker-compose.yml up | |
| # docker compose -f products/batch_exports/backend/tests/docker-compose.yml up |
| @pytest_asyncio.fixture | ||
| async def azurite_container(container_name: str): | ||
| """Create and cleanup Azurite container for test.""" | ||
| client = BlobServiceClient.from_connection_string(AZURITE_CONNECTION_STRING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: If create_container() raises an exception, the BlobServiceClient connection will never be closed since client.close() is only in the finally block after yield. Consider using async context manager for proper resource management.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/tests/temporal/destinations/azure_blob/conftest.py, line 57:
<comment>If `create_container()` raises an exception, the `BlobServiceClient` connection will never be closed since `client.close()` is only in the finally block after `yield`. Consider using async context manager for proper resource management.</comment>
<file context>
@@ -0,0 +1,118 @@
+@pytest_asyncio.fixture
+async def azurite_container(container_name: str):
+ """Create and cleanup Azurite container for test."""
+ client = BlobServiceClient.from_connection_string(AZURITE_CONNECTION_STRING)
+ container_client = client.get_container_client(container_name)
+
</file context>
| return cls( | ||
| container_client=container_client, | ||
| prefix=inputs.prefix, | ||
| data_interval_start=inputs.data_interval_start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Normalize data_interval_start to a stable sentinel (e.g. "START") when generating blob/manifest keys; otherwise backfills that set data_interval_start=None will create filenames starting with None-....
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py, line 156:
<comment>Normalize `data_interval_start` to a stable sentinel (e.g. "START") when generating blob/manifest keys; otherwise backfills that set `data_interval_start=None` will create filenames starting with `None-...`.</comment>
<file context>
@@ -0,0 +1,382 @@
+ return cls(
+ container_client=container_client,
+ prefix=inputs.prefix,
+ data_interval_start=inputs.data_interval_start,
+ data_interval_end=inputs.data_interval_end,
+ batch_export_model=inputs.batch_export_model,
</file context>
| data_interval_start=inputs.data_interval_start, | |
| data_interval_start=inputs.data_interval_start or "START", |
| record_batch_schema = await wait_for_schema_or_producer(queue, producer_task) | ||
| if record_batch_schema is None: | ||
| external_logger.info( | ||
| "Batch export will finish early as there is no data matching specified filters in range %s - %s", | ||
| inputs.data_interval_start or "START", | ||
| inputs.data_interval_end or "END", | ||
| ) | ||
| return BatchExportResult(records_completed=0, bytes_exported=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Missing schema normalization that exists in all other batch export destinations. After wait_for_schema_or_producer, other destinations normalize the schema to make all fields nullable to handle inconsistent nullability between batches. This could cause failures when processing data with mixed nullability constraints.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py, line 276:
<comment>Missing schema normalization that exists in all other batch export destinations. After `wait_for_schema_or_producer`, other destinations normalize the schema to make all fields nullable to handle inconsistent nullability between batches. This could cause failures when processing data with mixed nullability constraints.</comment>
<file context>
@@ -0,0 +1,382 @@
+ stage_folder=inputs.stage_folder,
+ )
+
+ record_batch_schema = await wait_for_schema_or_producer(queue, producer_task)
+ if record_batch_schema is None:
+ external_logger.info(
</file context>
| record_batch_schema = await wait_for_schema_or_producer(queue, producer_task) | |
| if record_batch_schema is None: | |
| external_logger.info( | |
| "Batch export will finish early as there is no data matching specified filters in range %s - %s", | |
| inputs.data_interval_start or "START", | |
| inputs.data_interval_end or "END", | |
| ) | |
| return BatchExportResult(records_completed=0, bytes_exported=0) | |
| record_batch_schema = await wait_for_schema_or_producer(queue, producer_task) | |
| if record_batch_schema is None: | |
| external_logger.info( | |
| "Batch export will finish early as there is no data matching specified filters in range %s - %s", | |
| inputs.data_interval_start or "START", | |
| inputs.data_interval_end or "END", | |
| ) | |
| return BatchExportResult(records_completed=0, bytes_exported=0) | |
| import pyarrow as pa | |
| record_batch_schema = pa.schema( | |
| [field.with_nullable(True) for field in record_batch_schema] | |
| ) |
❌ Hobby deploy smoke test: FAILEDFailing fast because: Timed out after 45 minutes Connection errors: 173 Run 20851202131 | Consecutive failures: 2 |
|
Closing this. Cherry-picked snapshot commit into #43977. |
Problem
Just want to cherry pick snapshots into #43977.
Changes
How did you test this code?
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Publish to changelog?