Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(airbyte-cdk): Add limitation for number of partitions to PerPartitionCursor #42406

Conversation

tolik0
Copy link
Contributor

@tolik0 tolik0 commented Jul 22, 2024

What

Add limitation for the number of partitions in PerPartitionCursor

How

Reused the approach from file-based CDK. Old partitions are deleted when the state overflows.

User Impact

This update should help to avoid OOM errors.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Jul 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Sep 6, 2024 10:41am

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Jul 22, 2024
@tolik0 tolik0 force-pushed the tolik0/airbyte-cdk/add-partitions-limitation-to-perpartitioncursor branch from 55f2cb7 to d4a303b Compare August 21, 2024 15:19
@tolik0 tolik0 marked this pull request as ready for review August 21, 2024 15:21
@tolik0 tolik0 requested review from girarda, maxi297, bazarnov and a team August 21, 2024 15:22
Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tolik0 shouldn't we only use the limit on the number of partitions with the fallback to global state to avoid re-processing partitions as full refresh or is the rationale that re-processing is better than blowing up the state size?

@@ -41,6 +43,7 @@ class PerPartitionCursor(DeclarativeCursor):
Therefore, we need to manage state per partition.
"""

DEFAULT_MAX_PARTITIONS_NUMBER = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this be private?
_DEFAULT_MAX_PARTITIONS_NUMBER = 10_000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second question: is it safe to increase this number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same value as in file-based; the user can have multiple streams with such states, so it is not safe to use a bigger value.

@@ -49,12 +52,15 @@ class PerPartitionCursor(DeclarativeCursor):
def __init__(self, cursor_factory: CursorFactory, partition_router: PartitionRouter):
self._cursor_factory = cursor_factory
self._partition_router = partition_router
self._cursor_per_partition: MutableMapping[str, DeclarativeCursor] = {}
self._cursor_per_partition: OrderedDict[str, DeclarativeCursor] = OrderedDict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment in the code explaining why the cursor_per_partition needs to be ordered?

@tolik0
Copy link
Contributor Author

tolik0 commented Aug 21, 2024

@girarda The GlobalCursor fallback will be introduced in a separate PR. This next update will handle deleted partitions using the global state instead of performing a full refresh.

Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎸

@tolik0
Copy link
Contributor Author

tolik0 commented Sep 5, 2024

The regression tests are failing for two streams due to unrelated error, the same as during the update to airbyte-cdk v5.
Screenshot from 2024-09-05 20-10-02
Screenshot from 2024-09-05 20-09-59
Screenshot from 2024-09-05 18-46-31

@darynaishchenko
Copy link
Collaborator

darynaishchenko commented Sep 6, 2024

/approve-regression-tests

Check job output.

✅ Approving regression tests

@tolik0 tolik0 merged commit 03b7e1a into master Sep 6, 2024
31 of 34 checks passed
@tolik0 tolik0 deleted the tolik0/airbyte-cdk/add-partitions-limitation-to-perpartitioncursor branch September 6, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants