Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for PrivateLink Connection Status History table #22936

Merged
merged 2 commits into from
Nov 9, 2023

Conversation

rjobanp
Copy link
Contributor

@rjobanp rjobanp commented Nov 3, 2023

Motivation

Tips for reviewer

Rendered Version

Some code links for helpful context:

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@rjobanp rjobanp requested review from benesch, petrosagg, jubrad, a team and jkosh44 November 3, 2023 20:26
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@pH14
Copy link
Contributor

pH14 commented Nov 6, 2023

Curious if there's a way to link this up with https://github.com/MaterializeInc/database-issues/issues/6495 / https://github.com/MaterializeInc/database-issues/issues/6422, where the design here is the implementation detail of how PrivateLink connection history is fed into a more generic system collection. And maybe mz_aws_privatelink_connection_status_history is then a view on top of that data that prettifies PrivateLink connection history specifically. Food for thought, haven't personally thought about this problem long enough to have a strong opinion!

We'll also want to verify that our VPC Endpoint controller peridiocally resyncs to update the CR status. From a super quick read through, I'm not sure if an accepted / running connection ever gets reconciled again, so we might be missing events.

@rjobanp
Copy link
Contributor Author

rjobanp commented Nov 6, 2023

We'll also want to verify that our VPC Endpoint controller peridiocally resyncs to update the CR status. From a super quick read through, I'm not sure if an accepted / running connection ever gets reconciled again, so we might be missing events.

@pH14 I also thought this, but from what I can tell the reconciliation loop should be run about once-per hour since that's the default behavior of our k8s-controller, even when the apply method returns Ok(None) and doesn't explicitly request a re-queuing:
https://github.com/MaterializeInc/k8s-controller/blob/689ddd0d7a63b892b531125598fdd3387cd212c4/src/controller.rs#L238-L248

@rjobanp
Copy link
Contributor Author

rjobanp commented Nov 6, 2023

@benesch @jkosh44 Thanks for the feedback - I believe I've addressed your comments. One thing to note is that I 'moved' the in-memory map from the CloudResourceController.watch_vpc_endpoints to the new task in the Adapter Coordinator since it seems cleaner to have it initialize from the Storage Controller there.


Curious if there's a way to link this up with https://github.com/MaterializeInc/database-issues/issues/6495 / https://github.com/MaterializeInc/database-issues/issues/6422, where the design here is the implementation detail of how PrivateLink connection history is fed into a more generic system collection. And maybe mz_aws_privatelink_connection_status_history is then a view on top of that data that prettifies PrivateLink connection history specifically. Food for thought, haven't personally thought about this problem long enough to have a strong opinion!

@pH14 A more generic system collection could make sense to link those up, however my initial reaction is that it diverges from the implementation of mz_sink_status_history and mz_source_status_history which currently exist as separate collections. Refactoring those to be views on top of the generic collection may not be trivial, so it could lead to some scope creep here.

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Make sure you get storage eyes (@guswynn, @sploiselle, @bkirwi have all touched the source/sink status code in addition to @petrosagg) on this before getting too deep, but this looks really solid.

Copy link
Contributor

@guswynn guswynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need govenor, but other than that this looks good!

please message me if you have any questions about the implementation!

doc/developer/design/20231103_privatelink_status_table.md Outdated Show resolved Hide resolved
Comment on lines +138 to +142
2. Upon inspecting all the existing `VpcEndpoint`s in our `us-east-1` cluster
I noticed that they all had the exact same timestamp in the
`last_transition_time` field on their `Available` condition. This seems odd
so we should confirm that this field is being updated appropriately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the controller for this checks them all at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, though that seems a little unlikely since they should each have had their own reconciliation happen when they were created? @jubrad any ideas? You can run

kubectl get vpcendpoints -A -o custom-columns=":metadata.name,:status.conditions"

to replicate

Copy link
Contributor

@jubrad jubrad Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing this: lastTransitionTime:2023-09-19T00:16:54Z ... this is odd. I actually wouldn't be that surprised if everything ran through reconciliation at pretty much the same time when the environment controller came up after this code was deployed. It's also possible no privatelink connections have been created since (and are still around). However, I would expect that to have occurred right after the code had merged, which was aug 23rd not sept 19th.

Here's where this should be set.
https://github.com/MaterializeInc/cloud/pull/7174/files#diff-2bdf70ebf1d6a05b7c26a2fecdb1f31ded5cc5c683a3e8bb036ddcb13b9a772bR394

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's what I see too. I'm not going to block merging this design doc on this, but will see if I can add validation for the last_transition_time field in test_privatelink in the cloud repo to make sure it's being set. Hopefully it's just something about the deployment time that caused all the us-east-1 objects to get the same timestamp

@rjobanp rjobanp force-pushed the privatelink-status-design-doc branch from eba7b84 to ccc67bf Compare November 8, 2023 19:27
@rjobanp rjobanp force-pushed the privatelink-status-design-doc branch from ccc67bf to d7967e2 Compare November 9, 2023 15:24
@rjobanp rjobanp merged commit f5efe43 into MaterializeInc:main Nov 9, 2023
9 checks passed
@rjobanp rjobanp deleted the privatelink-status-design-doc branch February 7, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants