Skip to content

Conversation

@tomasfarias
Copy link
Contributor

Problem

Just want to cherry pick snapshots into #43977.

Changes

How did you test this code?

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Publish to changelog?

buildwithmalik and others added 15 commits January 9, 2026 11:33
- Remove undefined azurite-data volume from docker-compose configs
- Add try/finally to ensure Azure client cleanup on fixture teardown
- Add non-empty check for expected_session_ids in test assertion
- Add UnsupportedCompressionError with proper error handling
- Add feature flag gating for Azure Blob destination
- Extract common utility functions to destinations/utils.py (get_key_prefix, get_manifest_key)
- Remove Azurite from main docker-compose files
- Switch tests to assert_clickhouse_records_in_azure_blob pattern (matching S3/BigQuery/Postgres) to properly account for export pipeline schema transformations
- Move S3 key generation tests from s3/test_utils.py to shared test_utils.py
- Update README to include batch-exports compose file in docker command
Start Azurite container in backend tests for Azure Blob Storage tests.
@tomasfarias tomasfarias marked this pull request as ready for review January 9, 2026 11:48
@tomasfarias
Copy link
Contributor Author

We certainly don't need that many reviewers holy moly 😅

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

Migration SQL Changes

Hey 👋, we've detected some migrations on this PR. Here's the SQL output for each migration, make sure they make sense:

posthog/migrations/0963_add_azure_blob_destination.py

BEGIN;
--
-- Alter field type on batchexportdestination
--
-- (no-op)
--
-- Alter field kind on integration
--
-- (no-op)
COMMIT;

Last updated: 2026-01-09 12:05 UTC (4f70c5f)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

🔍 Migration Risk Analysis

We've analyzed your migrations for potential risks.

Summary: 0 Safe | 1 Needs Review | 0 Blocked

⚠️ Needs Review

May have performance impact

posthog.0963_add_azure_blob_destination
  └─ #1 ⚠️ AlterField
     Field alteration may cause table locks or data loss (check if changing type or constraints)
     model: batchexportdestination, field: type, field_type: CharField
  └─ #2 ⚠️ AlterField
     Field alteration may cause table locks or data loss (check if changing type or constraints)
     model: integration, field: kind, field_type: CharField

Last updated: 2026-01-09 12:06 UTC (4f70c5f)

@tomasfarias
Copy link
Contributor Author

tomasfarias commented Jan 9, 2026

Summary: 0 Safe | 1 Needs Review | 0 Blocked

lol, it's a no-op migration, how is it not safe?

@tests-posthog
Copy link
Contributor

tests-posthog bot commented Jan 9, 2026

Visual regression: Storybook UI snapshots updated

Changes: 5 snapshots (5 modified, 0 added, 0 deleted)

What this means:

  • Snapshots have been automatically updated to match current rendering
  • Next CI run will switch to CHECK mode to verify stability
  • If snapshots change again, CHECK mode will fail (indicates flapping)

Next steps:

  • Review the changes to ensure they're intentional
  • Approve if changes match your expectations
  • If unexpected, investigate component rendering

Review snapshot changes →

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 35 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="products/batch_exports/backend/temporal/destinations/utils.py">

<violation number="1" location="products/batch_exports/backend/temporal/destinations/utils.py:49">
P1: When `data_interval_start` is `None`, this will produce manifest keys like `"None-2024-01-01_manifest.json"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</violation>

<violation number="2" location="products/batch_exports/backend/temporal/destinations/utils.py:65">
P1: When `data_interval_start` is `None`, this will produce object keys with base names like `"None-2024-01-01"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</violation>
</file>

<file name="products/batch_exports/backend/tests/docker-compose.yml">

<violation number="1" location="products/batch_exports/backend/tests/docker-compose.yml:7">
P2: Typo in usage example: `producs` should be `products`. This would cause the command to fail if copied directly.</violation>
</file>

<file name="products/batch_exports/backend/tests/temporal/destinations/azure_blob/conftest.py">

<violation number="1" location="products/batch_exports/backend/tests/temporal/destinations/azure_blob/conftest.py:57">
P2: If `create_container()` raises an exception, the `BlobServiceClient` connection will never be closed since `client.close()` is only in the finally block after `yield`. Consider using async context manager for proper resource management.</violation>
</file>

<file name="products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py">

<violation number="1" location="products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py:156">
P2: Normalize `data_interval_start` to a stable sentinel (e.g. "START") when generating blob/manifest keys; otherwise backfills that set `data_interval_start=None` will create filenames starting with `None-...`.</violation>

<violation number="2" location="products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py:276">
P2: Missing schema normalization that exists in all other batch export destinations. After `wait_for_schema_or_producer`, other destinations normalize the schema to make all fields nullable to handle inconsistent nullability between batches. This could cause failures when processing data with mixed nullability constraints.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

"""Generate object storage key for batch export files."""
key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model)

base_file_name = f"{data_interval_start}-{data_interval_end}"
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: When data_interval_start is None, this will produce object keys with base names like "None-2024-01-01" instead of using a sensible fallback like "START". This is inconsistent with the handling in get_allowed_template_variables which uses data_interval_start or "START".

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/utils.py, line 65:

<comment>When `data_interval_start` is `None`, this will produce object keys with base names like `"None-2024-01-01"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</comment>

<file context>
@@ -1,4 +1,83 @@
+    """Generate object storage key for batch export files."""
+    key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model)
+
+    base_file_name = f"{data_interval_start}-{data_interval_end}"
+
+    if include_file_number:
</file context>
Fix with Cubic

) -> str:
"""Generate manifest file key."""
key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model)
return posixpath.join(key_prefix, f"{data_interval_start}-{data_interval_end}_manifest.json")
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: When data_interval_start is None, this will produce manifest keys like "None-2024-01-01_manifest.json" instead of using a sensible fallback like "START". This is inconsistent with the handling in get_allowed_template_variables which uses data_interval_start or "START".

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/utils.py, line 49:

<comment>When `data_interval_start` is `None`, this will produce manifest keys like `"None-2024-01-01_manifest.json"` instead of using a sensible fallback like `"START"`. This is inconsistent with the handling in `get_allowed_template_variables` which uses `data_interval_start or "START"`.</comment>

<file context>
@@ -1,4 +1,83 @@
+) -> str:
+    """Generate manifest file key."""
+    key_prefix = get_key_prefix(prefix, data_interval_start, data_interval_end, batch_export_model)
+    return posixpath.join(key_prefix, f"{data_interval_start}-{data_interval_end}_manifest.json")
+
+
</file context>
Fix with Cubic

# https://posthog.com/handbook/engineering/developing-locally
#
# Usage:
# docker compose -f producs/batch_exports/backend/tests/docker-compose.yml up
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Typo in usage example: producs should be products. This would cause the command to fail if copied directly.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/tests/docker-compose.yml, line 7:

<comment>Typo in usage example: `producs` should be `products`. This would cause the command to fail if copied directly.</comment>

<file context>
@@ -0,0 +1,17 @@
+# https://posthog.com/handbook/engineering/developing-locally
+#
+# Usage:
+#   docker compose -f producs/batch_exports/backend/tests/docker-compose.yml up
+#
+
</file context>
Suggested change
# docker compose -f producs/batch_exports/backend/tests/docker-compose.yml up
# docker compose -f products/batch_exports/backend/tests/docker-compose.yml up
Fix with Cubic

@pytest_asyncio.fixture
async def azurite_container(container_name: str):
"""Create and cleanup Azurite container for test."""
client = BlobServiceClient.from_connection_string(AZURITE_CONNECTION_STRING)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: If create_container() raises an exception, the BlobServiceClient connection will never be closed since client.close() is only in the finally block after yield. Consider using async context manager for proper resource management.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/tests/temporal/destinations/azure_blob/conftest.py, line 57:

<comment>If `create_container()` raises an exception, the `BlobServiceClient` connection will never be closed since `client.close()` is only in the finally block after `yield`. Consider using async context manager for proper resource management.</comment>

<file context>
@@ -0,0 +1,118 @@
+@pytest_asyncio.fixture
+async def azurite_container(container_name: str):
+    """Create and cleanup Azurite container for test."""
+    client = BlobServiceClient.from_connection_string(AZURITE_CONNECTION_STRING)
+    container_client = client.get_container_client(container_name)
+
</file context>
Fix with Cubic

return cls(
container_client=container_client,
prefix=inputs.prefix,
data_interval_start=inputs.data_interval_start,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Normalize data_interval_start to a stable sentinel (e.g. "START") when generating blob/manifest keys; otherwise backfills that set data_interval_start=None will create filenames starting with None-....

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py, line 156:

<comment>Normalize `data_interval_start` to a stable sentinel (e.g. "START") when generating blob/manifest keys; otherwise backfills that set `data_interval_start=None` will create filenames starting with `None-...`.</comment>

<file context>
@@ -0,0 +1,382 @@
+        return cls(
+            container_client=container_client,
+            prefix=inputs.prefix,
+            data_interval_start=inputs.data_interval_start,
+            data_interval_end=inputs.data_interval_end,
+            batch_export_model=inputs.batch_export_model,
</file context>
Suggested change
data_interval_start=inputs.data_interval_start,
data_interval_start=inputs.data_interval_start or "START",
Fix with Cubic

Comment on lines +276 to +283
record_batch_schema = await wait_for_schema_or_producer(queue, producer_task)
if record_batch_schema is None:
external_logger.info(
"Batch export will finish early as there is no data matching specified filters in range %s - %s",
inputs.data_interval_start or "START",
inputs.data_interval_end or "END",
)
return BatchExportResult(records_completed=0, bytes_exported=0)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing schema normalization that exists in all other batch export destinations. After wait_for_schema_or_producer, other destinations normalize the schema to make all fields nullable to handle inconsistent nullability between batches. This could cause failures when processing data with mixed nullability constraints.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At products/batch_exports/backend/temporal/destinations/azure_blob_batch_export.py, line 276:

<comment>Missing schema normalization that exists in all other batch export destinations. After `wait_for_schema_or_producer`, other destinations normalize the schema to make all fields nullable to handle inconsistent nullability between batches. This could cause failures when processing data with mixed nullability constraints.</comment>

<file context>
@@ -0,0 +1,382 @@
+            stage_folder=inputs.stage_folder,
+        )
+
+        record_batch_schema = await wait_for_schema_or_producer(queue, producer_task)
+        if record_batch_schema is None:
+            external_logger.info(
</file context>
Suggested change
record_batch_schema = await wait_for_schema_or_producer(queue, producer_task)
if record_batch_schema is None:
external_logger.info(
"Batch export will finish early as there is no data matching specified filters in range %s - %s",
inputs.data_interval_start or "START",
inputs.data_interval_end or "END",
)
return BatchExportResult(records_completed=0, bytes_exported=0)
record_batch_schema = await wait_for_schema_or_producer(queue, producer_task)
if record_batch_schema is None:
external_logger.info(
"Batch export will finish early as there is no data matching specified filters in range %s - %s",
inputs.data_interval_start or "START",
inputs.data_interval_end or "END",
)
return BatchExportResult(records_completed=0, bytes_exported=0)
import pyarrow as pa
record_batch_schema = pa.schema(
[field.with_nullable(True) for field in record_batch_schema]
)
Fix with Cubic

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

❌ Hobby deploy smoke test: FAILED

Failing fast because: Timed out after 45 minutes

Connection errors: 173
HTTP 502 count: 0
Last error: ConnectionError


Run 20851202131 | Consecutive failures: 2

@tomasfarias
Copy link
Contributor Author

Closing this. Cherry-picked snapshot commit into #43977.

@tomasfarias tomasfarias closed this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants