Skip to content

Conversation

lknite
Copy link

@lknite lknite commented Nov 27, 2024

Describe the change you are making here!

When running within a kubernetes workflow/pipelines, the http server used by the vm to communicate back completion cannot be reached directly. Instead, with kubernetes you 'expose' a pod via a LoadBalancer ip or ingress. The exposed ip is available but must be passed in somehow in order for it to be used.

This pull request adds a variable HTTPCallbackAddress, which if provided, can be used by a provider to pass the needed ip to the vm.

Another pull request has been submitted on the packer-plugin-proxmox as a first candidate once this change is in place (hashicorp/packer-plugin-proxmox#305), which will resolve hashicorp/packer-plugin-proxmox#304 .

There must also be template source code for those who are looking to create a provider. Such templated code should be updated also, eventually. I will look to do that later on.

Please include tests. Check out existent tests in the code for example.

The only test would be to see if the variable ends up in the config.hcl2spec.go . I suspect there already exist tests which test mapstructure-to-hcl2.

If your PR resolves any open issue(s), please indicate them like this so they will be closed when your PR is merged:

Closes #264

@lknite lknite requested a review from a team as a code owner November 27, 2024 03:31
Copy link

hashicorp-cla-app bot commented Nov 27, 2024

CLA assistant check
All committers have signed the CLA.

@lbajolet-hashicorp
Copy link
Contributor

Hey @lknite,

Regarding this PR, I have a few things to discuss before we can merge this if you don't mind. The code is very simple, but before we introduce this to the config/step here, I want to make sure we pick the right name/behaviour, as once it's in there, we cannot remove it if we've missed something.

  1. Regarding the name: I don't like the Callback part here, the idea of that configuration variable is to override all the other methods to populate HTTPIP essentially, is that correct? It's not a callback URI much more than any of the other potentially inferred addresses can be, therefore I would get rid of this in the variable's name to make it clearer. Maybe consider something like http_server_static_host, which imho is clearer.

  2. In terms of implementation, I am not certain the SDK is where we want to introduce this, as I'm not sure this is something we want to be broadly available. For the plugins that support connecting to an HTTP server that Packer exposes, generally the expectation is that they run on the same machine. While network configuration can be something complex, generally it's something that the plugin dealing with the HTTP server is able to infer from its topology and how the VM is started for the build. Having an extra global configuration knob for those attributes feels a bit early, especially since there's already a bunch of magic with the existing attributes there.

In the current state, I would not accept this today, as I don't have enough confidence that there isn't a weird edge-case that could throw-off the proposed implementation. I would think that, since Proxmox is the plugin you are interested in, you can probably implement this in a self-contained manner in that plugin for now (by adding an attribute to this plugin only for now), and later we can discuss whether or not to make this a global property.

Thanks for the PR, and sorry we won't merge it right now on the SDK at least.

@lknite
Copy link
Author

lknite commented Jan 20, 2025

Calling that variable "static_host" is pretty much the opposite of how kubernetes does things, trying to say in the most polite way possible that there maybe not be enough kubernetes knowledge on both sides of this one. I used 'callback' to differentiate the variable from the others, that it was more of a callback in the way oauth does in its workflow when its done.

In all my years of working with open source this was perhaps my proudest contribution. Spending whole days ... weeks of effort, trying to get image builder working within the kubernetes environment, talking with others who had tried and gave up, eager to have me succeed, looking forward to the update so they could use it, and maintainers who weren't really interested, I persisted and kept digging in until I found it, 3 tiny changes in 3 repos. Changes which enabled the use via kubernetes without interfering with anything else. The only step remaining, get buy-in and support from three repos simultaneously.

I can tell there has been a miscommunication, because you believe this goal can be accomplished leaving your repo out of the equation, and though I haven't looked at this code for about 90 days, I'm pretty sure it wasn't really possible without all three. It may be possible, but not with two tiny changes. Without this tiny change, the other two become major work arounds.

I'm not sure I have the energy to drive this any further if there's a total lack of interest... but there really should be, on your side, to want this to work within a kubernetes... one would think. In any case, its your project, if there's no interest, so be it.

If there is some interest, we can address them one at a time.

The concern there might be a weird edge cases is difficult for me to address. Maybe if you were to look at the pull requests in the other two repos it might help to make the over-all picture more clear.

If you let this one go I will to. If you reply back I can help to clarify things. This pull request is the type that if it isn't brought in with this effort, someone else will make the effort again later, or the tool will be replaced by something which works within the kubernetes environment. So ... in the end I suppose it doesn't matter. I can run the image-builder app via a docker container on a linux vm, so I have a work around. I was hoping to redesign the image-builder project's testing platform to run via kubernetes, but can't do so without the ability to run within kubernetes. I don't architect things outside of kubernetes anymore. I'm sure you are busy, its ok whatever you decide. I certainly wouldn't want to encourage a pull request if its a bad idea.

@lknite
Copy link
Author

lknite commented Jan 20, 2025

I did have one additional thought, it may be that rearchitecting things to always use the HTTPCallbackAddress strategy and rip out the existing strategy, because it is becoming "aged", expecting 0.0.0.0 to be used in a world of cloud solutions and all, might be a better way to go ... but I didn't want to do all that cause it'd be kind of a massive overhaul. The strategy in this pull request could be a seed towards that other direction though. Along that line, I think Callback is more descriptive, with the only other choice being perhaps Webhook. Alternatively, I could see some other solution being used in the future such as NATS for messaging, or a redis queue. Once things are spun up in kubernetes, additional micro-services become easily available.

I just googled 'callback vs webhook', maybe webhook is better?

Additional note: You mentioned that having a global option, via the sdk feels a bit early, kubernetes was released June 2014... but I know what you mean. I don't want to pressure you in any way, but here to interact with if doing so would be helpful.

@lknite
Copy link
Author

lknite commented Jan 20, 2025

I mentioned wanting to redesign to the image-builder testing process, just so that's more than words, here's the issue tracking that potential idea: kubernetes-sigs/image-builder#1585 (comment)

@lknite
Copy link
Author

lknite commented Jan 23, 2025

I'll post to the image-builder channel to see if there might be some other folks there interested in this pull request and able to add to the conversation.

@lknite
Copy link
Author

lknite commented Jan 23, 2025

You mentioned the new solution replacing the old, let me clarify, the existing solution still needs to be in place to bind to 0.0.0.0 , its just that the ip the vm uses to reach back to that server must be specified within a kubernetes environment, using the ip of the system where the 0.0.0.0 bind occured won't work ... so everything which currently exists needs to stay its just that we need to be able to specify the callback, which will either be an ip address or it can be an FQDN.

The existing solution just uses the ip of the system where the bind occurs, somehow we have to check if an ip was specified and use that for the callback ... without changing the bind 0.0.0.0 . If the code is inspected it looks like you can specify an ip, but doing so replaces the 0.0.0.0 with the ip specified, that's not what we want.

Someone may want to use the existing solution to change the 0.0.0.0 to something they specify, we wouldn't want to break that if they are using it, so I created a new environment variable. In two other git repos I added an "if" statement that checks the if callback is specified, then that's passed to the vm to communicate back when its finished, otherwise it has no effect (if not specified).

Once this is implemented and the other two git repos are implemented, and we can see it working, I would then over-time begin visiting the other image-builder related projects one at a time adding the two if statements so that they can work via kubernetes as well.

@lknite
Copy link
Author

lknite commented Jan 23, 2025

Regarding the name, I hesitate on 'static' though agree we should decide on something together as once it is in there it might stick around for awhile, some thoughts:

When something is initially deployed via kubernetes its is unreachable by the world, it must first be exposed. This is generally done through 1 of 2 methods. A LoadBalancerIP is requested, and provided automatically via kubernetes. We won't know what it is. Via cloud providers, each loadbalancerip costs money, and so ... the 2nd option is often used. A single loadbalancerip is used, and then a reverse-proxy is used to expose everything via FQDN. All the FQDNs resolve to the same single loadbalancerip.

The second option requires additional setup but is very common, cert-manger to generate certificates automatically, external-dns to register the FQDN with DNS, and a loadbalancer such as nginx to handle the reverse proxy. My preference would be to use an FQDN rather than a LoadBalancerIP, but folks might prefer to use a LoadBalancerIP.

If a LoadBalancerIP is used, it'll pretty much always be the same, but that shouldn't be expected. It's possible to force it to be the same but that's kind of falling back on "before kubernetes existed" and before infrastructure as code. Kubernetes lets you just kind of define here's the idea of how I want this deployed, then it fills in the gaps, generating the certs, getting an ip and registering that in dns ... you just deploy, then access via the FQDN without neededing to know anything in between.

Long story short, I would expect a loadbalancerip to change and would probably use an fqdn which overtime might resolve to different ips. If FQDN wasn't an option I'd suggest a variable name like "loadbalancerip", but since FQDN is an option, maybe something more specific to its purpose like "VM_STATUS_UPDATE_TEMPORARY_MESSAGING_SERVER", or more practically maybe "VmMsgServer"? I feel like the vm is "calling home" to report its ready, another reason I initially suggested Callback.

@lknite
Copy link
Author

lknite commented Jan 24, 2025

If I could have someone with the packer team make the change, without my help, I would think the best solution would be to "adjust" things so the localip is never used in any scenario automatically, that a "callbackurl" or whatever it is called, would always have to be set and used, and by default ... if no one sets it, then it would be set as the local ip ... but in all cases the 'callbackurl' would be used. ... with this pull request as the next best thing and least impactful.

@lknite
Copy link
Author

lknite commented Jan 24, 2025

I've been talking with a couple image-builder maintainers today and they are now fully aware of the problem, that it effects multiple image providers (they've been seeing the same issue reported now and then), and what it is that the pull requests solve. Now they are thinking about the issue. Essentially going through the same process I did. Let's see if they come up with the same solution or an alternative. At least a couple providers have created a custom work around. They are revisiting security etc ... I've asked them to update here with a note that they are aware of the issue so its not just me. Let's give them some time ...

@lbajolet-hashicorp
Copy link
Contributor

lbajolet-hashicorp commented Feb 13, 2025

Hey @lknite,

Coming back to this, to be clear when I mentioned this being renamed to static, I meant it in the way that you provide a static endpoint to connect to, not that the IP address itself is static. From a configuration standpoint, my understanding is that you run Packer in a pod somewhere on your cluster, and use it to expose an HTTP server that your temporary build VM will connect to. I assume this Packer/HTTP server pod/port is exposed to the rest of your cluster through a Service, which lets you use a FQDN to connect to it, while the IP is not predictable, the service FQDN is, and may even work across namespaces with the right URI/configuration.

That said, I can see why static may not be the ideal name here, in such a case might I suggest another name like http_host_override? That way it would make it clear that the goal here is to expose an address to which the VM will connect to during setup/provisioning, which will not be inferred by the plugin, but instead is provided through configuration.
I don't necessarily agree with the callback/webhook terminology here, as to me both imply that this is an address to connect to as a response to an event, which is not the case here, instead it is the address that you can connect to in order to get whichever file is exposed by Packer when running a build.
It is however a terminology discussion though, which is not necessarily my main point of contention with this contribution.

I have clumsily tried to explain my reasoning, but allow me to attempt once more. Picking the connection address for that HTTP server is a plugin's responsibility, and populating the HTTPIP field (through the http_ip state key) is also its responsibility. My main concern here is mostly a consistency one.
By introducing this attribute to the SDK, it immediately becomes available to the configuration of every plugin that imports that configuration object, as soon as they update their version of the SDK. This will likely be a cause of confusion for users, as plugins may or may not honour this change when it comes to their codebase, and there are no safety nets that force plugins to use all of these attributes, so I worry that users see this in the documentation because of that transitive relationship, but then when using it, notice that nothing happens, and open issues on the related plugin (or Packer itself), which increases the burden on plugin maintainers (i.e. mostly us, but others in the community too).

Most of the SDK's attributes here describe how the HTTP server is exposed, but not necessarily how to connect to it, and the steps that deal with how this address is chosen are specific to the plugin that decides to support it. Not all plugins support Packer exposing an HTTP server, and the instance being created connecting to it, typically cloud environments like AWS/GCE/Azure don't support that HTTP server in their configuration, instead most of the plugins that do support it are traditionally locally-served (ex: qemu, hyper-v, etc.).
As such, I would think that this attribute should be defined and used within the context of the plugin that sets them, not as a global SDK attribute, since this is exclusively something that the plugin decides how to handle.

To my point, regarding your Proxmox PoC PR, I believe you can add the attribute to the configuration for the plugin, and handle it completely independently from this SDK change, which is what I would recommend in this case, as to not have this cascading effect on all the plugins that will update their version of the SDK.
If you want me to take a jab at this, please let me know, I can be wrong with my assessment of the situation here, so feel free to either try it yourself or ask me to look at it.

Please let me know what you make of this, to be clear, I am not closed to the idea of supporting a workflow like the one you want to introduce, I think it is a good idea overall for users like you that wish to run Packer in more complex environments than what we traditionally supported.
But I am wary of the scope of this change as part of the SDK for now, so my suggestion is to, first, democratise it in the relevant plugins, then if there's a good rationale to support it everywhere, the argument can be heard and we can revisit the idea of adding this to the SDK later.

@lknite
Copy link
Author

lknite commented Mar 1, 2025

I want what concerns you to happen. If users file an issue to their plugin provider asking for the feature to be implemented, that could help for all plugins to eventually offer the feature making them all work with kubernetes. After additional testing with the proxmox provider I probably would create a jira for the vmware provider next and then copy/paste it to all the other capi providers letting them know of the new variable and how its intended to be used suggesting they might want to adopt the strategy so their provider can also work via pipelines running within kubernetes.

I also want to save people from suffering, and would like for others who follow after me looking for this solution to not reinvent this wheel. The proxmox folks offered a work-around, and I suppose we can settle on a work-around.

I once read a book called Critical Path by Buckminster Fuller. In it he describes a way of thinking that he wanted to pass along. If he was thirsty he could solve that by going to the river and taking a drink, if he had a family he could solve it by taking a bottle, if he needed to solve it for a neighborhood a better solution would be needed, plumbing enters into the picture, for a village/city a water purification plant is need, if one city takes all the water and downstream the next city no longer has water a war may ensue, and so the water problem needs to be solved in such a way as to include both cities, naturally he concluded that since the mind comes up with different answers depending on how the problem is framed, to frame the solution from the point of the whole planet, and solve the water problem for all, would cause the mind to create a solution for everyone. I framed my life around this book, and tend to think of solutions at multiple levels. Regarding this issue my desire is to solve it for everyone, and in that pursuit created a solution which seems like it'd work for all the providers, I never tried to solve it for just myself. If its just for myself I could run it in a docker container. But even then, I wanted to create a testing infrastructure that'd work for all of the clusterapi providers. Given that my proposed idea for the testing would be via kubernetes, to test them all I'd need a work-around for all of them, and I guess I can approach the problem looking for a work-around for all of them.

It's open source, and its not my project, I wouldn't have to deal with any of the effects you've mentioned but you would. I'm only here now for this one thing at the moment. Whatever you'd like to do. I'm starting to work on other things and not as available as I was to help reorchestrate the clusterapi testing. The proxmox provider is being rewritten to included additional networking features, not related to this issue, but just improved options over all. When that releases, this year or next, I will probably revisit helping them architect a testing strategy which can work for all the providers.

What I expect is, more and more folks will run this tool via pipelines running within kubernetes, each time things fail they'll go looking for a resolution, either they'll find this variable and see that their provider supports it and move on without much thought, or they'll come across documentation indicating they have to use a work-around that's unique for each provider, and a few will look to create a solution rather than settle for a work around ... especially if they have reason to work with more than one provider.

I'm ok either way, maybe the next person will create something even better than what I've put together, my attention is starting to get pulled away from clusterapi as a whole, which at the time this pull request was submitted was all I was working on. I'm not in the code anymore like I was, but this one should be good to go if other people support it and think its a good idea.

@lknite
Copy link
Author

lknite commented Mar 1, 2025

maybe http_loadbalancer would be a signal to kubernetes folks and cause it to show up in a google search, while simultaneously perhaps discouraging others from using it because they would see it and think, "why would i need a loadbalancer?", and disregard it?

open to ideas on what it should be called

override sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants