Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to use public IP for the pod VM in Azure #2035

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bpradipt
Copy link
Member

This is useful for environments where the K8s cluster is run on a developer workstation and peer pod is created in Azure. Useful for testing, working on AI models requiring large VMs etc.
Similar functionality also exists for the AWS provider. So this PR also brings functional parity.

This change allows CAA to use the public IP of the pod VM to make a
connection to the kata-agent.

A static public IP is created and attached to the VM NIC.
Dynamic public IP is not working as the IP is not available immediately.
Network Security Group (NSG) should be adjusted to allow connectivity to port 15150
from the specific IP range of the systems running cloud-api-adaptor
(CAA).

Note that the communication between CAA and pod VM uses TLS.

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
The static public IP doesn't get deleted automatically, hence
delete it post the VM deletion.

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@bpradipt bpradipt marked this pull request as ready for review September 12, 2024 02:56
Copy link
Contributor

@mkulke mkulke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether we could have those resources being created (and deleted) implicitly in a createVM call instead of explicitly creating and deleting them (using a VirtualMachinePublicIPAddressConfiguration).

See this code sample: https://github.com/mkulke/mkosi-playground/blob/e7bdeed71f8a3820fa265bf1ca74c7fec2e0e6cb/launch-vm/main.go#L158-L176

I'm not sure why we weren't doing that for the NIC in the first place, so I need to test that. If it works like this we wouldn't have to manage the lifecycle of public ips and nics manually, which can be prone to race conditions.

@bpradipt
Copy link
Member Author

I was wondering whether we could have those resources being created (and deleted) implicitly in a createVM call instead of explicitly creating and deleting them (using a VirtualMachinePublicIPAddressConfiguration).

See this code sample: https://github.com/mkulke/mkosi-playground/blob/e7bdeed71f8a3820fa265bf1ca74c7fec2e0e6cb/launch-vm/main.go#L158-L176

I'm not sure why we weren't doing that for the NIC in the first place, so I need to test that. If it works like this we wouldn't have to manage the lifecycle of public ips and nics manually, which can be prone to race conditions.

Makes sense. On the public IP front I tried to do something similar but it didn't work for me.

@mkulke
Copy link
Contributor

mkulke commented Sep 12, 2024

I was wondering whether we could have those resources being created (and deleted) implicitly in a createVM call instead of explicitly creating and deleting them (using a VirtualMachinePublicIPAddressConfiguration).
See this code sample: https://github.com/mkulke/mkosi-playground/blob/e7bdeed71f8a3820fa265bf1ca74c7fec2e0e6cb/launch-vm/main.go#L158-L176
I'm not sure why we weren't doing that for the NIC in the first place, so I need to test that. If it works like this we wouldn't have to manage the lifecycle of public ips and nics manually, which can be prone to race conditions.

Makes sense. On the public IP front I tried to do something similar but it didn't work for me.

i see, let me try that. it shouldn't be too hard to convert the code to implicit nic creation.

@mkulke
Copy link
Contributor

mkulke commented Sep 13, 2024

I have played around with implicit creation of nics, it seems to work for me, at least I didn't encounter problems after some casual testing: https://github.com/confidential-containers/cloud-api-adaptor/compare/main...mkulke:cloud-api-adaptor:mkulke/az-use-implicit-nic-creation?expand=1

@bpradipt
Copy link
Member Author

@mkulke thanks, let me try with your changes.

@mkulke
Copy link
Contributor

mkulke commented Sep 16, 2024

@mkulke thanks, let me try with your changes.

yeah, that would be interesting. I'm currently observing network problems after more thorough testing. I can't really explain that yet, since the infra looks similar when created implicitly. that's pretty curious, and it would be good to get to the bottom of this problem. but that might not be trivial, and if it's urgent we could consider merging the explicit management of IP addresses in this PR.

there is a similar resource leak risk as it exists for NICs currently, but since this setting is off-by-default and should not be turned on casually, it might be tolerable.

@bpradipt
Copy link
Member Author

@mkulke my initial test using your code resulted in the following error. I cherry-picked the changes on top of 0.9.0 to work on my current setup.

RESPONSE 400: 400 Bad Request
ERROR CODE: OutboundConnectivityNotEnabledOnVM
--------------------------------------------------------------------------------
{
  "error": {
    "details": [],
    "code": "OutboundConnectivityNotEnabledOnVM",
    "message": "No outbound connectivity configured for virtual machine /subscriptions/***/resourceGroups/aro-mz4v0ygu/providers/Microsoft.Compute/virtualMachines/podvm-app-47cb4c3b. Please attach standard load balancer or public IP address to VM, create NAT gateway or configure user-defined routes (UDR) in the subnet. Learn more at aka.ms/defaultoutboundaccess."
  }
}

Is this what you are seeing? I'm yet to debug it though

@mkulke
Copy link
Contributor

mkulke commented Sep 17, 2024

@mkulke my initial test using your code resulted in the following error. I cherry-picked the changes on top of 0.9.0 to work on my current setup.

RESPONSE 400: 400 Bad Request
ERROR CODE: OutboundConnectivityNotEnabledOnVM
--------------------------------------------------------------------------------
{
  "error": {
    "details": [],
    "code": "OutboundConnectivityNotEnabledOnVM",
    "message": "No outbound connectivity configured for virtual machine /subscriptions/***/resourceGroups/aro-mz4v0ygu/providers/Microsoft.Compute/virtualMachines/podvm-app-47cb4c3b. Please attach standard load balancer or public IP address to VM, create NAT gateway or configure user-defined routes (UDR) in the subnet. Learn more at aka.ms/defaultoutboundaccess."
  }
}

Is this what you are seeing? I'm yet to debug it though

hmm no, that's not what I am seeing. my vm is created successfully, but there are network connectivity issues once the vm is created (it works initially, but fails during image-pull). the error is interesting, though.

@mkulke
Copy link
Contributor

mkulke commented Sep 17, 2024

"No outbound connectivity configured for virtual machine /subscriptions/***/resourceGroups/aro-mz4v0ygu/providers/Microsoft.Compute/virtualMachines/podvm-app-47cb4c3b. Please attach standard load balancer or public IP address to VM, create NAT gateway or configure user-defined routes (UDR) in the subnet. Learn more at aka.ms/defaultoutboundaccess."

I tried to reproduce that error. if the vnet we are attaching the vm to does not have a NAT gw, it will refuse to start the vm. this is expected behaviour. see this link

@mkulke
Copy link
Contributor

mkulke commented Sep 17, 2024

I'm currently observing network problems after more thorough testing. I can't really explain that yet, since the infra looks similar when created implicitly. that's pretty curious, and it would be good to get to the bottom of this problem.

I found the origin of my issues. it turned out to be specific to the network I was testing in. the implicitly created NICs were subject to outbound traffic restrictions, while the explicitly created NICs were not.

@mkulke
Copy link
Contributor

mkulke commented Sep 17, 2024

@bpradipt I pushed some changes to that branch, there's a commit that (always) adds a public IP (the above error should be gone, even if you have no NAT gw on your subnet) please test if you have time. If that works for you, I'd open a discrete PR with the implicit-NIC-creation and we can base a -use-public-ip toggle on that branch. that would get rid of a lot of brittle cleanup/management logic.

@mkulke
Copy link
Contributor

mkulke commented Sep 18, 2024

ERROR CODE: OutboundConnectivityNotEnabledOnVM

I dove a bit more into this, it turns out this is actually expected, albeit a bit surprising. The reason we are having outbound connectivity atm is because when we create a NIC and a VM seperately the NIC will get "default outbound access" via a transparent public ip. This will not be the case if we crate the NIC as part of a VM.

Implicitly assigning a public IP is not great security-wise and hence this behaviour is being retired. So, either way we have to make sure that podvms will be able to pull images from the internet (or not - depending on whether a user would want that in their deployment) by using explicit network configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants