Description
Is your feature request related to a problem?/Why is this needed
This is inspired from the discussion in kubernetes-csi/external-provisioner#221 . I conclude the issue discussed there: If we want to comply with the CSI spec by telling the SP that the node selected by scheduler must be able to access the volume (i.e. specify --strict-topology
), we need to pass only one item in requisite
. And preferred
should be the subset of requisite
, so it is not useful here. Now we cannot pass more context to SP (the AllowedTopologies
of storage class, the zones in which we deployed nodes, etc.)
On the other hand, if we don't specify --strict-topology
, Kubernetes assumes the first topology in preferred
can access the volume, which is not in the CSI spec.
Currently external-provisioner also does not make good use of these two fields. From this table, we can see requisite
is always a random ordered version of preferred
, which adds no information at all.
The original "one of" semantics of requisite
is a bit confusing by itself. One may naturally think an empty "one of" list means always false (e.g. python3 -c 'print(any([]))'
prints False
). But the empty list is defined as "no requirements" (always true) in CSI.
Describe the solution you'd like in detail
Change the semantic of the existing TopologyRequirement
message:
- The volume should be accessible from ALL of the
requisite
topologies, instead of the original "one of". - The
preferred
topologies are just hints to SP (e.g. for placing replicas), and not necessarily the subset ofrequisite
.
This should be easy to understand, and greatly simplified the original x
vs n
conditions. requisite
and preferred
are orthogonal.
This should be a breaking change, so introduce a new STRICT_VOLUME_TOPOLOGY_REQUISITE controller capability to enable this new semantic.
At the Kubernetes side, to support such new semantic, we should set:
requisite
- If
WaitForFirstConsumer
, the topology of the scheduler-selected node - Else, empty
- If
preferred
- If available, allowed topologies from storage class
- Else if enabled by a new flag (say
--aggregated-topology
), aggregated cluster topology, with scheduler-selected node at first. - Else, empty
The behavior is also clearer and simpler. Again, requisite
and preferred
are orthogonal.
For reference, the original requisite
generated by Kubernetes is the allowed topologies from storage class or aggregated cluster topology, with special cases:
- if
WaitForFirstConsumer
and--strict-topology
is specified, the topology of the scheduler-selected node - if
Immediate
binding and--immediate-topology=false
is specified and allowed topologies from storage class is not available, empty
I would expect minimal code changes to existing CSI drivers that works with Kubernetes:
- For those using
--strict-topology=true
and--immediate-topology=false
, they should now use--aggregated-topology=false
. Therequisite
is not changed, andpreferred
contains more information about allowed topologies, which should do no harm. - For those using
--strict-topology=false
and--immediate-topology=true
, They should now look atrequisite
for hard requirement. But if they continue to assume the first item inpreferred
as hard requirement, they should also continue to work with Kubernetes. - For other cases, I don't think they are useful.
Describe alternatives you've considered
Add a requirement that the SP MUST ensure the volume is accessible from the first item of preferred
.
This makes the already complex requirement more complex. This seems breaking for SP, but Kubernetes already expect SP to implement this, or else Kubernetes will fail to schedule the pod. So, this way is breaking for spec, but not for implementation.
And consider what if a distributed workload needs to access the same volume from more than one topology?
Additional context
I'm not sure about the impact on the Mesos implementation.
Here is my draft of the new TopologyRequirement
// Current description applies if the SP has
// STRICT_VOLUME_TOPOLOGY_REQUISITE capability. See version 1.9.0 for
// the description if such capability is not present.
message TopologyRequirement {
// Specifies the list of topologies the provisioned volume MUST be
// accessible from.
// This field is OPTIONAL. If TopologyRequirement is specified either
// requisite or preferred or both MUST be specified.
//
// The SP MUST make the provisioned volume available to
// all topologies from the list of requisite topologies. If it is
// unable to do so, the SP MUST fail the CreateVolume call.
//
// The volume MAY be accessible from additional topologies. If it
// is, the SP SHOULD prefer the topologies in preferred list.
//
// For example, if a volume should be accessible from a single zone,
// and requisite =
// {"region": "R1", "zone": "Z2"}
// then the provisioned volume MUST be accessible from the "region"
// "R1" and the "zone" "Z2". the SP MAY select the second zone
// independently, e.g. "R1/Z4".
repeated Topology requisite = 1;
// Specifies the list of topologies the CO would prefer the volume to
// be accessible from (in order of preference).
//
// This field is OPTIONAL. If TopologyRequirement is specified either
// requisite or preferred or both MUST be specified.
//
// An SP MUST attempt to make the provisioned volume available using
// the preferred topologies in order from first to last.
//
// If requisite is specified, the topologies in preferred list MAY
// also present in the list of requisite topologies. In such case,
// the SP MAY use this hint to determine where the primary replica is
// placed.
//
// If the topologies in preferred list are not present in the list of
// requisite topologies, the SP MAY use them as hints about future
// access patterns, and MAY place additional replicas in those
// topologies. The SP MAY use an opaque parameter in
// CreateVolumeRequest to determine the number of replicas.
//
// Example:
// requisite =
// {"zone": "Z2"},
// {"zone": "Z3"}
// preferred =
// {"zone": "Z3"}
// {"zone": "Z2"}
// {"zone": "Z4"}
// then the SP MUST make the provisioned volume accessible from
// "zone" "Z3" and "Z2". The SP MAY place the primary replica in
// "zone" "Z3". The SP MAY place additional replicas in "zone" "Z4".
repeated Topology preferred = 2;
}
The rationale of introducing --aggregated-topology
to replace the original --strict-topology
and --immediate-topology
:
Both --strict-topology
and --immediate-topology
are introduced to resolve a similar issue: avoiding the long list of requirements. But one for WaitForFirstConsumer
and one for Immediate
. Based on this proposal, the preferred
list is now irrelevant to the binding timing. So there is no reason to configure this based on binding timing.
P.S. I'm waiting for #552 to be approved to be able to run make
on my MacBook, so I can open a PR for this proposal.