Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add sync sharding #1891

Merged
merged 4 commits into from
Sep 18, 2024
Merged

feat: Add sync sharding #1891

merged 4 commits into from
Sep 18, 2024

Conversation

erezrokah
Copy link
Member

@erezrokah erezrokah commented Sep 11, 2024

Summary

Goes with cloudquery/plugin-pb-go#401. Still testing this so in draft

Part of https://github.com/cloudquery/cloudquery-issues/issues/2214 (internal issue)


Use the following steps to ensure your PR is ready to be reviewed

  • Read the contribution guidelines 🧑‍🎓
  • Run go fmt to format your code 🖊
  • Lint your changes via golangci-lint run 🚨 (install golangci-lint here)
  • Update or add tests 🧪
  • Ensure the status checks below are successful ✅

@github-actions github-actions bot added the feat label Sep 11, 2024
@erezrokah erezrokah force-pushed the feat/sharding branch 3 times, most recently from bd09fa9 to 70bc550 Compare September 17, 2024 14:01
@erezrokah erezrokah marked this pull request as ready for review September 17, 2024 16:13
@erezrokah erezrokah requested a review from a team as a code owner September 17, 2024 16:13
@github-actions github-actions bot added feat and removed feat labels Sep 17, 2024
@@ -1,6 +1,8 @@
module github.com/cloudquery/plugin-sdk/examples/simple_plugin
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes can be reverted once cloudquery/plugin-pb-go#401 is merged and released

@@ -40,27 +40,34 @@ func (s *syncClient) syncDfs(ctx context.Context, resolvedResources chan<- *sche
s.metrics.initWithClients(table, clients)
}

var wg sync.WaitGroup
tableClients := make([]tableClient, 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically the DFS scheduler didn't need to create table clients pairs, since we didn't do any sorting in the DFS scheduler. Because of the sharding support, we need to first the table client pairs, so we can shard them before the sync starts

@erezrokah erezrokah changed the title feat: Add sync sharding options feat: Add sync sharding Sep 17, 2024
@erezrokah erezrokah changed the title feat: Add sync sharding feat: Add sync Sep 17, 2024
@erezrokah erezrokah changed the title feat: Add sync feat: Add sync sharding Sep 17, 2024
@github-actions github-actions bot added feat and removed feat labels Sep 17, 2024
@@ -45,6 +45,7 @@ func (s *syncClient) syncShuffle(ctx context.Context, resolvedResources chan<- *
// so users have a little bit of control over the randomization.
seed := hashTableNames(tableNames)
shuffle(tableClients, seed)
tableClients = shardTableClients(tableClients, s.shard)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see that shuffle is deterministic (at the moment), but I still think it's a bad idea to shard after shuffling. I'd move it before the shuffle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let me try and switch the order and re-run the tests. We shuffle (this is the default in AWS) to avoid rate limits. Don't think sharding before shuffling will make a difference in that aspect but I'll re-test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think sharding before shuffling will make a difference in that aspect but I'll re-test

I think it will be fine since we round-robin before we shuffle anyways

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok did a bit of testing and it looks good so we can shard before shuffle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where collecting the tables could be non-deterministic? Normally the tables are hardcoded in a plugin, so it should not be the case. If there is a plugin where the tables are dynamic, and they could change between syncs (e.g. if they are discovered by an API which is non-deterministic), sharding would not work.

In either case, I think the deterministic requirement is worth a one-liner comment.

Copy link
Member Author

@erezrokah erezrokah Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where collecting the tables could be non-deterministic?

This is a good point, and definitely a limitation of this approach, see below ⬇️

  • It can happen due to a bug in the plugin https://github.com/cloudquery/cloudquery-private/pull/4299.
  • Plugins with dynamic tables don't use the scheduler, they do their own thing so they would need to implement sharding on the plugin's side (if needed. e.g. For Postgres source probably better to use a stronger machine instead of sharding)
  • I can think of other cases, e.g. someone creating an AWS account after shard 1/2 discovery and before shard 2/2 discovery. If we discover all accounts, that will mess up the sharding. A solution would be to hard code the accounts in the spec to avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about the requirement

scheduler/scheduler.go Show resolved Hide resolved
scheduler/scheduler.go Show resolved Hide resolved
@kodiakhq kodiakhq bot merged commit e1823f8 into main Sep 18, 2024
7 checks passed
@kodiakhq kodiakhq bot deleted the feat/sharding branch September 18, 2024 10:14
kodiakhq bot pushed a commit that referenced this pull request Sep 18, 2024
🤖 I have created a release *beep* *boop*
---


## [4.63.0](v4.62.0...v4.63.0) (2024-09-18)


### Features

* Add sync sharding ([#1891](#1891)) ([e1823f8](e1823f8))


### Bug Fixes

* **deps:** Update module github.com/cloudquery/plugin-pb-go to v1.22.3 ([#1895](#1895)) ([b05d24b](b05d24b))
* **deps:** Update module google.golang.org/grpc to v1.66.2 ([#1893](#1893)) ([6d70b88](6d70b88))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Sep 19, 2024
kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Sep 19, 2024
#### Summary

We'll need to release a few plugins with cloudquery/plugin-sdk#1891 first, hence the future date in the command description
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants