55 http://creativecommons.org/licenses/by/3.0/legalcode
66-->
77
8+ <!-- cSpell:ignore Sylva Schiff Kanod argocd GitOps -->
89# HostClaim: multi-tenancy and hybrid clusters
910
1011## Status
@@ -106,12 +107,17 @@ implementation details of the compute resource.
106107 of such a framework will be addressed in another design document.
107108* Pivoting client clusters resources (managed clusters that are not the
108109 initial cluster).
110+ * Using BareMetalHosts defined in other clusters. The HostClaim concept
111+ supports this use case with some extensions. The design is outlined in the
112+ alternative approach section but is beyond the scope of this document.
109113
110114## Proposal
111115
112116### User Stories
113117
114- #### As a user I would like to execute a workload on an arbitrary server
118+ #### Deployment of Simple Workloads
119+
120+ As a user I would like to execute a workload on an arbitrary server.
115121
116122The OS image is available in qcow format on a remote server at `` url_image `` .
117123It supports cloud-init and a script can launch the workload at boot time
@@ -202,7 +208,9 @@ value depends on the characteristics of the computer.
202208* When I destroy the host, the association is broken and another user can take
203209 over the server.
204210
205- #### As an infrastructure administrator I would like to host several isolated clusters
211+ #### Multi-tenancy
212+
213+ As an infrastructure administrator I would like to host several isolated clusters
206214
207215All the servers in the data-center are registered as BareMetalHost in one or
208216several namespaces under the control of the infrastructure manager. Namespaces
@@ -229,7 +237,9 @@ and are destroyed unless they are tagged for node reuse. The BareMetalHosts are
229237recycled and are bound to new HostClaims, potentially belonging to other
230238clusters.
231239
232- #### As a cluster administrator I would like to build a cluster with different kind of nodes
240+ #### Hybrid Clusters
241+
242+ As a cluster administrator I would like to build a cluster with different kind of nodes.
233243
234244This scenario assumes that:
235245
@@ -277,7 +287,9 @@ Controllers for disposable resources such as virtual machine typically do not
277287use hostSelectors. Controllers for a "bare-metal as a service" service
278288may use selectors.
279289
280- #### As a cluster administrator I would like to install a new baremetal cluster from a transient cluster
290+ #### Manager Cluster Bootstrap
291+
292+ As a cluster administrator I would like to install a new baremetal cluster from a transient cluster.
281293
282294The bootstrap process can be performed as usual from an ephemeral cluster
283295(e.g., a KinD cluster). The constraint that all resources must be in the same
@@ -300,4 +312,276 @@ controller is beyond the scope of this specification.
300312
301313## Design Details
302314
303- TBD.
315+ ### Implementation Details/Notes/Constraints
316+
317+ ### Risks and Mitigations
318+
319+ #### Security Impact of Making BareMetalHost Selection Cluster-wide
320+
321+ The main difference between Metal3 machines and HostClaims it the
322+ selection process where a HostClaim can be bound to a BareMetalHost
323+ in another namespace. We must make sure that this behavior is expected
324+ from the owner of BareMetalHost resources, especially when we upgrade the
325+ metal3 cluster api provider to a version supporting HostClaim.
326+
327+ The solution is to enforce that BareMetalHost that can be bound to a
328+ HostClaim have a label (proposed name: ` ` hosts.metal3.io/namespaces` ` )
329+ restricting authorized HostClaims to specific namespaces. The value could be
330+ either ` ` *` ` for no constraint, or a comma separated list of namespace names.
331+
332+ #### Tenants Trying to Bypass the Selection Mechanism
333+
334+ The fact that a HostClaim is bound to a specific BareMetalHost will appear
335+ as a label in the HostClaim and the HostClaim controller will use it to find
336+ the associated BareMetalHost. This label could be modified by a malicious
337+ tenant.
338+
339+ But the BareMetalHost has also a consumer reference. The label is only an
340+ indication of the binding. If the consumer reference is invalid (different
341+ from the HostClaim label), the label MUST be erased and the HostClaim
342+ controller MUST NOT accept the binding.
343+
344+ #### Performance Impact
345+
346+ The proposal introduces a new resource with an associated controller between
347+ the Metal3Machine and the BareMetalHost. There will be some duplication
348+ of information between the BareMetalHost and the HostClaim status. The impact
349+ for each node should still be limited especially when compared to the cost of
350+ each Ironic action.
351+
352+ Because we plan to have several controllers for different kind of compute
353+ resource, one can expect a few controllers working on the same custom resource.
354+ This may create additional pressure on Kubernetes api server. It is possible
355+ to limit the amount of exchanged information on a specific controller using
356+ server based filters on the watch/list. To use this feature on current
357+ Kubernetes versions, the HostClaim kind field must be copied in a label.
358+
359+ #### Impact on Other Cluster Api Components
360+
361+ There should be none: other components should mostly rely on Machine and Cluster
362+ objects. Some tools may look at Metal3Machine conditions where some condition
363+ names may be modified but the semantic of Ready condition will be preserved.
364+
365+ ### Work Items
366+
367+ ### Dependencies
368+
369+ ### Test Plan
370+
371+ ### Upgrade / Downgrade Strategy
372+
373+ ### Version Skew Strategy
374+
375+ ## Drawbacks
376+
377+ ## Alternatives
378+
379+ ### Multi-Tenancy Without HostClaim
380+
381+ We assume that we have a Kubernetes cluster managing a set of clusters for
382+ cluster administrators (referred to as tenants in the following). Multi-tenancy
383+ is a way to ensure that tenants have only control over their clusters.
384+
385+ There are at least two other ways for implementing multi-tenancy without
386+ HostClaim. These methods proxy the entire definition of the cluster
387+ or proxy the BareMetalHost itself.
388+
389+ #### Isolation Through Overlays
390+
391+ A solution for multi-tenancy is to hide all cluster resources from the end
392+ user. In this approach, clusters and BareMetalHosts are defined within a single
393+ namespace, but the cluster creation process ensures that resources
394+ from different clusters do not overlap.
395+
396+ This approach was explored in the initial versions of the Kanod project.
397+ Clusters must be described by the tenant in a git repository and the
398+ descriptions are imported by a GitOps framework (argocd). The definitions are
399+ processed by an argocd plugin that translates the YAML expressing the user's
400+ intent into Kubernetes resources, and the naming of resources created by
401+ this plugin ensures isolation.
402+
403+ Instead of using a translation plugin, it would be better to use a set of
404+ custom resources. However, it is important to ensure that they are defined in
405+ separate namespaces.
406+
407+ This approach has several drawbacks:
408+
409+ * The plugin or the controllers for the abstract clusters are complex
410+ applications if we want to support many options, and they become part of
411+ the trusted computing base of the cluster manager.
412+ * It introduces a new level of access control that is distinct from the
413+ Kubernetes model. If we want tooling or observability around the created
414+ resources, we would need custom tools that adhere to this new policy, or we
415+ would need to reflect everything we want to observe in the new custom
416+ resources.
417+ * This approach does not solve the problem of hybrid clusters.
418+
419+ #### Ephemeral BareMetalHost
420+
421+ Another solution is to have separate namespaces for each cluster but
422+ import BareMetalHosts in those namespaces on demand when new compute resources
423+ are needed.
424+
425+ The cluster requires a resource that acts as a source of BareMetalHosts, which
426+ can be parameterized on on servers requirements and the number of replicas. The
427+ concept of
428+ [BareMetalPool](https://gitlab.com/Orange-OpenSource/kanod/baremetalpool)
429+ in Kanod is similar to ReplicaSets for pods. This concept is also used in
430+ [this proposal](https://github.com/metal3-io/metal3-docs/pull/268) for a
431+ Metal3Host resource. The number of replicas must be synchronized with the
432+ requirements of the cluster. It may be updated by a
433+ [separate controller](https://gitlab.com/Orange-OpenSource/kanod/kanod-poolscaler)
434+ checking the requirements of machine deployments and control-planes.
435+
436+ The main security risk is that when a cluster releases a BareMetalHost, it may
437+ keep the credentials that provide full control over the server.
438+ This can be resolved if those credentials are temporary. In Kanod BareMetalPool
439+ obtain new servers from a REST API implemented by a
440+ [BareMetalHost broker](https://gitlab.com/Orange-OpenSource/kanod/brokerdef).
441+ The broker implementation utilizes either the fact that Redfish is in fact an
442+ HTTP API to implement a proxy or the capability of Redfish to create new users
443+ with a Redfish ` ` operator` ` role to implemented BareMetalHost resources with
444+ a limited lifespan.
445+
446+ A pool is implemented as an API that is protected by a set of credentials that
447+ identify the user.
448+
449+ The advantages of this approach are:
450+
451+ * Support for pivot operation, even for tenant clusters, as it provides a
452+ complete bare-metal-as-a-service solution.
453+ * Cluster administrators have full access to the BMC and can configure servers
454+ according to their needs using custom procedures that are not exposed by
455+ standard Metal3 controllers.
456+ * Network isolation can be established before the BareMetalHost is created in
457+ the scope of the cluster. There is no transfer of servers from one network
458+ configuration to another, which could invalidate parts of the introspection.
459+
460+ The disadvantages of the BareMetalPool approach are:
461+
462+ * The implementation of the broker with its dedicated server is quite complex.
463+ * To have full dynamism over the pool of servers, a new type of autoscaler is
464+ needed.
465+ * Unnecessary inspection of servers are performed when they are transferred
466+ from a cluster (tenant) to another.
467+ * The current implementation of the proxy is limited to the Redfish protocol
468+ and would require significant work for IPMI.
469+
470+ #### HostClaims as a right to consume BareMetalHosts
471+
472+ The approach combines the concept of remote endpoint of BareMetalPools with
473+ the API-oriented approach of HostClaims, as described above.
474+
475+ In this variation, the HostClaim will be an object in the BareMetalHost
476+ namespace defined with an API endpoint to drive the associated BareMetalHost.
477+ The associated credentials are known from the Metal3Machine controller
478+ because they are associated with the cluster. The Metal3 machine controller
479+ will use this endpoint and the credentials to create the HostClaim. The
480+ HostClaim controller will associate the HostClaim with a BareMetalHost.
481+ Control actions and information about the status of the BareMetalHost will
482+ be exchanged with the Metal3 machine controller through the endpoint.
483+
484+ The main advantage of the approach is that BareMetalHosts do not need to be on
485+ the same cluster.
486+
487+ The main drawbacks are:
488+
489+ * It only addresses the multi-tenancy scenario. The hybrid scenario is not
490+ solved but the usage of HostClaim outside Cluster-API is not addressed
491+ either. The main reason is that there is no counter-part of the
492+ BareMetalHost in the namespace of the tenant.
493+ * The end user will have very limited direct view on the compute resources
494+ it is using even when the BareMetalHosts are on the same cluster.
495+
496+ Extending HostClaims with a remote variant fulfills the same requirements
497+ but keeps an explicit object in the namespace of the cluster definition
498+ representing the API offered by this approach.
499+
500+ A remote HostClaim is a a HostClaim with kind set to ` ` remote` ` and at
501+ least two arguments:
502+
503+ * One points to a URL and a set of credentials to access the endpoint on a
504+ remote cluster,
505+ * The second is the kind of the copied HostClaim created on the remote
506+ cluster.
507+
508+ The two HostClaims are synchronized: the specification of the source HostClaim
509+ is copied to the remote one (except the kind part). The status of the target
510+ HostClaim is copied back to the source.. For meta-data, most of them are copied
511+ from the target to the source. The exact implementation of this extension is
512+ beyond the scope of this proposal.
513+
514+ ### Hybrid Clusters Without HostClaim
515+
516+ #### Control-Planes as a Service
517+
518+ The Kubernetes control-plane can be considered as an application with a
519+ single endpoint. Some Cluster API control-plane providers implement a factory
520+ for new control-planes directly, without relying on the infrastructure
521+ provider. Usually, this control-plane is hosted in the management cluster
522+ as a regular Kubernetes application. [Kamaji](https://kamaji.clastix.io/),
523+ [k0smotron](https://docs.k0smotron.io/stable/) implement this approach.
524+
525+ The cluster is hybrid because the control-plane pods are not hosted on standard
526+ nodes, but workers are usually all implemented by a single infrastructure
527+ provider and are homogeneous.
528+
529+ The approach solves the problem of sharing resources for control-planes but
530+ does not address the creation of clusters with distinct needs for workers.
531+ Only one kind of workers is supported.
532+
533+ #### Many Infrastructure Cluster Resources per Cluster
534+
535+ It is possible to coerce Cluster API to create mixed clusters using the
536+ fact that the relationship between objects is loose. The approach is
537+ presented in a [blog post](https://metal3.io/blog/2022/07/08/One_cluster_multiple_providers.html).
538+
539+ The goal is to define a cluster over technologies I_1, ... I_n where I_1 is
540+ the technology used for the control-plane.
541+ One Cluster object is defined, but an infrastructure cluster I_iCluster
542+ is defined for each technology I_i (for example a Metal3Cluster for Metal3).
543+ These infrastructure cluster objects use the same control-plane endpoint. The
544+ Cluster object references the I_1Cluster object as ` ` infrastructureRef` ` .
545+
546+ With some limited assumptions over the providers, the approach works even if
547+ the cluster is unaware of technologies I_2...I_n and it requires no modification
548+ to Cluster API.
549+
550+ There is no standardization in the definition of machine deployments across
551+ different technologies. For example, Metal3 is the sole infrastructure
552+ provider that employs DataTemplates to capture parameters specific to a
553+ given node.
554+
555+ But the main issue is that many existing providers are opinionated about
556+ networking. Unfortunately, mixing infrastructure providers requires custom
557+ configuration to interconnect the different deployments. A framework that does
558+ not handle networking is a better base for building working hybrid clusters.
559+
560+ #### Bring Your Own Hosts
561+
562+ Bring Your Own Host (BYOH) is a cluster api provider that uses existing compute
563+ resources that run a specific agent used for registering the resource and
564+ deploying Kubernetes on the server.
565+
566+ BYOH does not impose many constraints on the compute resource, but it must be
567+ launched before and it must know how to access the management cluster.
568+ A solution is to implement for each kind of targeted compute
569+ resource, a concept of pool launching an image with the agent activated and
570+ the bootstrap credentials. An example for BareMetalHost could be the notion
571+ of BareMetalPools presented above.
572+
573+ An autoscaler can be designed to keep the size of the pools synchronized with
574+ the need of the cluster (size of the machine deployments and control-plane with
575+ additional machines for updates).
576+
577+ The main drawbacks of the approach are:
578+
579+ * The approach requires many new resources and controllers. Keeping all of them
580+ synchronized is complex. BareMetalPools are already a complex approach for
581+ BareMetalHost multi-tenancy.
582+ * Performing updates or pivot with BYOH is not easy. The way agents are stopped
583+ or change their target cluster requires modifications of the BYOH controllers.
584+
585+ A prototype of this approach has been done in the Kanod project.
586+
587+ ## References
0 commit comments