-
Notifications
You must be signed in to change notification settings - Fork 70
Handle node client certs #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return err | ||
} | ||
approvalMsg += " (no SAN validation)" | ||
return fmt.Errorf("failed to list machines: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do this now? testing this PR would then be contingent on both the bootstrap self-approve script and PR to disable the auto-approve. just wondering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it passes CI we know that we don't need the failover, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you could get a transient error and hit the failover path because the code is too blunt. I think we need to remove it before GA so might as well do it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it passes CI we know that we don't need the failover, right?
Yeah, otherwise removing it might stop us from bootstrapping with no machine API available. So let's see what CI says.. I am +1 on getting rid of the failover here though.
/retest |
3 similar comments
/retest |
/retest |
/retest |
Once we got past the AWS issues this ran fine. Now I just need to add more unit tests. |
Well at least the serving certs bit is working fine. The client cert is failing on RBAC:
|
} | ||
|
||
func findMatchingMachineFromInternalDNS(nodeName string, machines []v1beta1.Machine) (v1beta1.Machine, bool) { | ||
for _, machine := range machines { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is something we can guarantee being available for every cloud. Not sure NodeInternalDNS
is exposed on the API in Azure/GCP. It might be only exposed via metadata which makes it unreachable and a no go for the machine API controllers in those clouds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any other way to very specifically find the machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the privateIP be sufficient?
medium term we should include the node/machine providerID here but this is not supported today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSR only includes the (future) node name so we have to base the check on that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we get the privateIP here (in future providerID as well) https://github.com/openshift/cluster-machine-approver/pull/26/files#diff-a482a1a664767249971b0b71ca9baaf0R194 and pass it down to I see the node does not exist yet at that pointfindMatchingMachineBy*()
?
csr_check.go
Outdated
return fmt.Errorf("machine for node %s already has node ref", nodeName) | ||
} | ||
|
||
if !isWorkerMachine(nodeMachine) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we don't want masters to go through this however is this effectively dropping machines which does not contain https://github.com/openshift/cluster-machine-approver/pull/26/files#diff-a482a1a664767249971b0b71ca9baaf0R28? How about any additional user machines/machineSets? I wonder if we might want to have an specific label for this e.g machine.openshift.io/CSR: true
or possible better the other way around and have masters to opt out with machine.openshift.io/CSR-need-approval: false
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted the most strict whitelist based approach. This approval flow is only meant for workers so this check seemed appropriate. I am not for or against adding a new label, we just need to discuss that type of change with the other stakeholders.
csr_check.go
Outdated
@@ -88,39 +106,39 @@ func validateCSRContents(req *certificatesv1beta1.CertificateSigningRequest, csr | |||
// authorizeCSR authorizes the CertificateSigningRequest req for a node's server certificate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs to be updated
certificatesv1beta1.UsageKeyEncipherment, | ||
certificatesv1beta1.UsageDigitalSignature, | ||
certificatesv1beta1.UsageClientAuth, | ||
certificatesv1beta1.UsageServerAuth, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have variants that have less usages and wrong ones (that replace client with server) for good measure.
csr_check.go
Outdated
return fmt.Errorf("machine for node %s is not a worker", nodeName) | ||
} | ||
|
||
start := nodeMachine.CreationTimestamp.Add(-maxMachineDelta) // TODO maybe this should be really small to account only for clock drift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce as discussed on call: 5-10 sec
csr_check.go
Outdated
} | ||
|
||
func isWorkerMachine(machine v1beta1.Machine) bool { | ||
return machine.Labels[machineRoleKey] == machineRoleTypeValue && machine.Labels[machineTypeKey] == machineRoleTypeValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to the to-be-agreed-upon label
@enj looks good, squash the commits and I'll tag. |
/lgtm |
… client CSRs Because of the existense of this ClusterRoleBinding, kube-controller-manager approves all node client CSRs requests and therefore allowing wrongful actors to get node level client priviledges. cluster-machine-approver [1] takes on the role of approving only certain node client CSR requests. [1]: openshift/cluster-machine-approver#26
/test all |
/refresh |
csr_check.go
Outdated
// 2. Node does not exist | ||
// 3. Use machine API internal DNS to locate matching machine based on node name | ||
// 4. Machine must not have a node ref | ||
// 5. Machine must opt-in to CSR auto approve via label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we weigh pros/cons of label versus annotation?
csr_check.go
Outdated
} | ||
|
||
func machineWantsCSRAutoApprove(machine v1beta1.Machine) bool { | ||
return machine.Labels[machineCSRAutoApproveKey] == machineCSRAutoApproveValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to know pros/cons of label versus annotation.
i tend to think of labels as grouping, and annotations as behavioral.
/hold |
apiVersion: v1
kind: ConfigMap
metadata:
name: machine-approver-config
namespace: openshift-cluster-machine-approver
data:
config.yaml: |-
nodeClientCert:
disabled: true
|
Signed-off-by: Monis Khan <mkhan@redhat.com>
/hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enj, mrogers950 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherrypick release-4.1 |
@enj: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
3 similar comments
/retest |
/retest |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
…lete PR for cluster-machine-approver [1] is taking over the approval of CSRs for client certificates for Machines that end up as Nodes in Openshift clusters. But during bootstrapping, cluster-machine-approver is not available and therefore, this service is required to approve CSRs until we have successfully bootstrapped the control plane, after which cluster-machine-approver or users take over the role of approving any new CSRs. Currently, all CSRs are automatically approved without any condition, this PR scopes it to only during bootstrapping phase, securing the endpoint for later use. [1]: openshift/cluster-machine-approver#26
/retest |
@enj: new pull request created: #27 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…lete PR for cluster-machine-approver [1] is taking over the approval of CSRs for client certificates for Machines that end up as Nodes in Openshift clusters. But during bootstrapping, cluster-machine-approver is not available and therefore, this service is required to approve CSRs until we have successfully bootstrapped the control plane, after which cluster-machine-approver or users take over the role of approving any new CSRs. Currently, all CSRs are automatically approved without any condition, this PR scopes it to only during bootstrapping phase, securing the endpoint for later use. [1]: openshift/cluster-machine-approver#26
… client CSRs Because of the existense of this ClusterRoleBinding, kube-controller-manager approves all node client CSRs requests and therefore allowing wrongful actors to get node level client priviledges. cluster-machine-approver [1] takes on the role of approving only certain node client CSR requests. [1]: openshift/cluster-machine-approver#26
… client CSRs Because of the existense of this ClusterRoleBinding, kube-controller-manager approves all node client CSRs requests and therefore allowing wrongful actors to get node level client priviledges. cluster-machine-approver [1] takes on the role of approving only certain node client CSR requests. [1]: openshift/cluster-machine-approver#26
…lete PR for cluster-machine-approver [1] is taking over the approval of CSRs for client certificates for Machines that end up as Nodes in Openshift clusters. But during bootstrapping, cluster-machine-approver is not available and therefore, this service is required to approve CSRs until we have successfully bootstrapped the control plane, after which cluster-machine-approver or users take over the role of approving any new CSRs. Currently, all CSRs are automatically approved without any condition, this PR scopes it to only during bootstrapping phase, securing the endpoint for later use. [1]: openshift/cluster-machine-approver#26
nodeBootstrapperUsername = "system:serviceaccount:openshift-machine-config-operator:node-bootstrapper" | ||
|
||
maxMachineClockSkew = 10 * time.Second | ||
maxMachineDelta = 10 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask where this choice of number comes from? In our CI system (nested virt env), we seem to be quite a bit late:
E0701 18:15:26.348374 1 main.go:174] CSR csr-xl8fs creation time 2019-07-01 18:15:26 +0000 UTC not in range (2019-07-01 17:40:40 +0000 UTC, 2019-07-01 17:50:50 +0000 UTC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not performing any cryptographic verification. Instead we are making a best effort check based on the state of the machine API. Thus we need to keep the window very short to prevent the chance of an attacker guessing the correct name of a new node. Increasing the window gives the attacker more time and lowers the security of this approver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enj Thanks for explaining. However the time of choice, seems very arbitrary to me. Please correct me if I'm wrong since I'm no security export but isn't 10mis a very large window already for an attacker? Does this check really add any real security to the system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also rate limit based on the total amount of outstanding CSRs. I would not call it great security, but we make it hard enough to exploit for the time being.
Note that the solution is not to increase the window. The correct solution is to cryptographically tie the CSR with the new node that is being provisioned. Then an attacker cannot create a CSR to steal credentials. It would also allow nodes to create more than one CSR (it allows them to not store state about previously requested CSRs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your quick reply and detailed explanation.
I'm actually more suggesting if the window is needed at all. Is the 10 mins time short enough for attackers to not be able to exploit? If it's not, I would rather suggest this check be removed as it's causing us problems and not helping in that case. The real solution you mention, sounds good and it won't affect us negatively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to know when that 10 min windows occurs - it is actually pretty hard to know without access to the machine API or the underlying cloud API. That basically leaves brute force methods, which the rate limiting does a good job of preventing. Without the window you could just lay down a handful of guess CSRs and see if you get lucky. The windows forces a much more involved attack. Basically it forces an active attacker instead of a passive one. The implementations of the machine API need to properly coordinate the creation of the machine object and the related CSR object.
Feel free to bring this up in the ABC arch call for further discussion if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for another very thorough and detailed answer. Much appreciated!
You'll have to understand that this breaks our (Openshift Installer on libvirt) CI right now. It's a nested environment so understandably things are slower than normal, despite the fact that the VM has 8 core, 32G RAM and and SSD drive assigned to it. Apart from arguing for removal or extension of the time window check, if you have any other suggestions on how to fix or workaround this issue, please do let me know cause this is very high priority for us (multiple teams).
I'm afraid I don't know what ABC is. I'm still relatively new here but I'd love to join if we could discuss it there. :)
Signed-off-by: Monis Khan mkhan@redhat.com