Skip to content

Handle node client certs #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 16, 2019

Conversation

enj
Copy link
Contributor

@enj enj commented May 10, 2019

Signed-off-by: Monis Khan mkhan@redhat.com

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 10, 2019
return err
}
approvalMsg += " (no SAN validation)"
return fmt.Errorf("failed to list machines: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do this now? testing this PR would then be contingent on both the bootstrap self-approve script and PR to disable the auto-approve. just wondering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it passes CI we know that we don't need the failover, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you could get a transient error and hit the failover path because the code is too blunt. I think we need to remove it before GA so might as well do it now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it passes CI we know that we don't need the failover, right?

Yeah, otherwise removing it might stop us from bootstrapping with no machine API available. So let's see what CI says.. I am +1 on getting rid of the failover here though.

@enj
Copy link
Contributor Author

enj commented May 10, 2019

/retest

3 similar comments
@enj
Copy link
Contributor Author

enj commented May 10, 2019

/retest

@enj
Copy link
Contributor Author

enj commented May 11, 2019

/retest

@enj
Copy link
Contributor Author

enj commented May 11, 2019

/retest

@enj
Copy link
Contributor Author

enj commented May 11, 2019

Once we got past the AWS issues this ran fine. Now I just need to add more unit tests.

@enj
Copy link
Contributor Author

enj commented May 11, 2019

Well at least the serving certs bit is working fine. The client cert is failing on RBAC:

I0511 12:39:05.629859       1 main.go:122] CSR csr-5696g not authorized: failed to check if node ip-10-0-146-108.ec2.internal already exists: nodes "ip-10-0-146-108.ec2.internal" is forbidden: User "system:serviceaccount:openshift-cluster-machine-approver:machine-approver-sa" cannot get resource "nodes" in API group "" at the cluster scope

}

func findMatchingMachineFromInternalDNS(nodeName string, machines []v1beta1.Machine) (v1beta1.Machine, bool) {
for _, machine := range machines {
Copy link
Member

@enxebre enxebre May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is something we can guarantee being available for every cloud. Not sure NodeInternalDNS is exposed on the API in Azure/GCP. It might be only exposed via metadata which makes it unreachable and a no go for the machine API controllers in those clouds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other way to very specifically find the machine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the privateIP be sufficient?
medium term we should include the node/machine providerID here but this is not supported today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSR only includes the (future) node name so we have to base the check on that...

Copy link
Member

@enxebre enxebre May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we get the privateIP here (in future providerID as well) https://github.com/openshift/cluster-machine-approver/pull/26/files#diff-a482a1a664767249971b0b71ca9baaf0R194 and pass it down to findMatchingMachineBy*()? I see the node does not exist yet at that point

csr_check.go Outdated
return fmt.Errorf("machine for node %s already has node ref", nodeName)
}

if !isWorkerMachine(nodeMachine) {
Copy link
Member

@enxebre enxebre May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we don't want masters to go through this however is this effectively dropping machines which does not contain https://github.com/openshift/cluster-machine-approver/pull/26/files#diff-a482a1a664767249971b0b71ca9baaf0R28? How about any additional user machines/machineSets? I wonder if we might want to have an specific label for this e.g machine.openshift.io/CSR: true or possible better the other way around and have masters to opt out with machine.openshift.io/CSR-need-approval: false WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted the most strict whitelist based approach. This approval flow is only meant for workers so this check seemed appropriate. I am not for or against adding a new label, we just need to discuss that type of change with the other stakeholders.

csr_check.go Outdated
@@ -88,39 +106,39 @@ func validateCSRContents(req *certificatesv1beta1.CertificateSigningRequest, csr
// authorizeCSR authorizes the CertificateSigningRequest req for a node's server certificate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to be updated

certificatesv1beta1.UsageKeyEncipherment,
certificatesv1beta1.UsageDigitalSignature,
certificatesv1beta1.UsageClientAuth,
certificatesv1beta1.UsageServerAuth,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have variants that have less usages and wrong ones (that replace client with server) for good measure.

csr_check.go Outdated
return fmt.Errorf("machine for node %s is not a worker", nodeName)
}

start := nodeMachine.CreationTimestamp.Add(-maxMachineDelta) // TODO maybe this should be really small to account only for clock drift

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce as discussed on call: 5-10 sec

csr_check.go Outdated
}

func isWorkerMachine(machine v1beta1.Machine) bool {
return machine.Labels[machineRoleKey] == machineRoleTypeValue && machine.Labels[machineTypeKey] == machineRoleTypeValue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to the to-be-agreed-upon label

@mrogers950
Copy link

@enj looks good, squash the commits and I'll tag.

@enj enj force-pushed the enj/i/node_client branch from df72578 to f63854b Compare May 13, 2019 19:59
@mrogers950
Copy link

/lgtm

abhinavdahiya added a commit to abhinavdahiya/machine-config-operator that referenced this pull request May 13, 2019
… client CSRs

Because of the existense of this ClusterRoleBinding, kube-controller-manager approves all node client CSRs requests and therefore
allowing wrongful actors to get node level client priviledges. cluster-machine-approver [1] takes on the role of approving only certain node client CSR
requests.

[1]: openshift/cluster-machine-approver#26
@enj
Copy link
Contributor Author

enj commented May 13, 2019

/test all

@enj
Copy link
Contributor Author

enj commented May 13, 2019

/refresh

csr_check.go Outdated
// 2. Node does not exist
// 3. Use machine API internal DNS to locate matching machine based on node name
// 4. Machine must not have a node ref
// 5. Machine must opt-in to CSR auto approve via label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we weigh pros/cons of label versus annotation?

csr_check.go Outdated
}

func machineWantsCSRAutoApprove(machine v1beta1.Machine) bool {
return machine.Labels[machineCSRAutoApproveKey] == machineCSRAutoApproveValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to know pros/cons of label versus annotation.

i tend to think of labels as grouping, and annotations as behavioral.

@enj
Copy link
Contributor Author

enj commented May 14, 2019

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2019
@enj
Copy link
Contributor Author

enj commented May 15, 2019

apiVersion: v1
kind: ConfigMap
metadata:
  name: machine-approver-config
  namespace: openshift-cluster-machine-approver
data:
  config.yaml: |-
    nodeClientCert:
      disabled: true
I0515 18:57:00.248954       1 config.go:23] machine approver config: {NodeClientCert:{Disabled:true}}
I0515 18:57:00.249006       1 main.go:183] Starting Machine Approver
I0515 18:57:25.336449       1 main.go:107] CSR csr-cwcjt added
I0515 18:57:25.344832       1 main.go:132] CSR csr-cwcjt not authorized: CSR csr-cwcjt for node client cert rejected as the flow is disabled
I0515 18:57:25.344856       1 main.go:164] Error syncing csr csr-cwcjt: CSR csr-cwcjt for node client cert rejected as the flow is disabled

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2019
Signed-off-by: Monis Khan <mkhan@redhat.com>
@enj enj force-pushed the enj/i/node_client branch from a89d98d to d7e76c4 Compare May 15, 2019 19:27
@mrogers950
Copy link

/hold cancel
/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 15, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enj, mrogers950

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enj
Copy link
Contributor Author

enj commented May 15, 2019

/cherrypick release-4.1

@openshift-cherrypick-robot

@enj: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mrogers950
Copy link

/retest

3 similar comments
@mrogers950
Copy link

/retest

@enj
Copy link
Contributor Author

enj commented May 15, 2019

/retest

@enj
Copy link
Contributor Author

enj commented May 16, 2019

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

abhinavdahiya added a commit to abhinavdahiya/installer that referenced this pull request May 16, 2019
…lete

PR for cluster-machine-approver [1] is taking over the approval of CSRs for client certificates for Machines that end up
as Nodes in Openshift clusters.

But during bootstrapping, cluster-machine-approver is not available and therefore, this service is required to approve CSRs until
we have successfully bootstrapped the control plane, after which cluster-machine-approver or users take over the role of approving any new CSRs.

Currently, all CSRs are automatically approved without any condition, this PR scopes it to only during bootstrapping phase, securing the endpoint for later use.

[1]: openshift/cluster-machine-approver#26
@enj
Copy link
Contributor Author

enj commented May 16, 2019

/retest

@openshift-cherrypick-robot

@enj: new pull request created: #27

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/installer that referenced this pull request May 16, 2019
…lete

PR for cluster-machine-approver [1] is taking over the approval of CSRs for client certificates for Machines that end up
as Nodes in Openshift clusters.

But during bootstrapping, cluster-machine-approver is not available and therefore, this service is required to approve CSRs until
we have successfully bootstrapped the control plane, after which cluster-machine-approver or users take over the role of approving any new CSRs.

Currently, all CSRs are automatically approved without any condition, this PR scopes it to only during bootstrapping phase, securing the endpoint for later use.

[1]: openshift/cluster-machine-approver#26
runcom pushed a commit to runcom/machine-config-operator that referenced this pull request May 16, 2019
… client CSRs

Because of the existense of this ClusterRoleBinding, kube-controller-manager approves all node client CSRs requests and therefore
allowing wrongful actors to get node level client priviledges. cluster-machine-approver [1] takes on the role of approving only certain node client CSR
requests.

[1]: openshift/cluster-machine-approver#26
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request May 16, 2019
… client CSRs

Because of the existense of this ClusterRoleBinding, kube-controller-manager approves all node client CSRs requests and therefore
allowing wrongful actors to get node level client priviledges. cluster-machine-approver [1] takes on the role of approving only certain node client CSR
requests.

[1]: openshift/cluster-machine-approver#26
stbenjam pushed a commit to openshift-metal3/kni-installer that referenced this pull request May 29, 2019
…lete

PR for cluster-machine-approver [1] is taking over the approval of CSRs for client certificates for Machines that end up
as Nodes in Openshift clusters.

But during bootstrapping, cluster-machine-approver is not available and therefore, this service is required to approve CSRs until
we have successfully bootstrapped the control plane, after which cluster-machine-approver or users take over the role of approving any new CSRs.

Currently, all CSRs are automatically approved without any condition, this PR scopes it to only during bootstrapping phase, securing the endpoint for later use.

[1]: openshift/cluster-machine-approver#26
nodeBootstrapperUsername = "system:serviceaccount:openshift-machine-config-operator:node-bootstrapper"

maxMachineClockSkew = 10 * time.Second
maxMachineDelta = 10 * time.Minute
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask where this choice of number comes from? In our CI system (nested virt env), we seem to be quite a bit late:

E0701 18:15:26.348374       1 main.go:174] CSR csr-xl8fs creation time 2019-07-01 18:15:26 +0000 UTC not in range (2019-07-01 17:40:40 +0000 UTC, 2019-07-01 17:50:50 +0000 UTC)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not performing any cryptographic verification. Instead we are making a best effort check based on the state of the machine API. Thus we need to keep the window very short to prevent the chance of an attacker guessing the correct name of a new node. Increasing the window gives the attacker more time and lowers the security of this approver.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enj Thanks for explaining. However the time of choice, seems very arbitrary to me. Please correct me if I'm wrong since I'm no security export but isn't 10mis a very large window already for an attacker? Does this check really add any real security to the system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also rate limit based on the total amount of outstanding CSRs. I would not call it great security, but we make it hard enough to exploit for the time being.

Note that the solution is not to increase the window. The correct solution is to cryptographically tie the CSR with the new node that is being provisioned. Then an attacker cannot create a CSR to steal credentials. It would also allow nodes to create more than one CSR (it allows them to not store state about previously requested CSRs).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your quick reply and detailed explanation.

I'm actually more suggesting if the window is needed at all. Is the 10 mins time short enough for attackers to not be able to exploit? If it's not, I would rather suggest this check be removed as it's causing us problems and not helping in that case. The real solution you mention, sounds good and it won't affect us negatively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to know when that 10 min windows occurs - it is actually pretty hard to know without access to the machine API or the underlying cloud API. That basically leaves brute force methods, which the rate limiting does a good job of preventing. Without the window you could just lay down a handful of guess CSRs and see if you get lucky. The windows forces a much more involved attack. Basically it forces an active attacker instead of a passive one. The implementations of the machine API need to properly coordinate the creation of the machine object and the related CSR object.

Feel free to bring this up in the ABC arch call for further discussion if desired.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for another very thorough and detailed answer. Much appreciated!

You'll have to understand that this breaks our (Openshift Installer on libvirt) CI right now. It's a nested environment so understandably things are slower than normal, despite the fact that the VM has 8 core, 32G RAM and and SSD drive assigned to it. Apart from arguing for removal or extension of the time window check, if you have any other suggestions on how to fix or workaround this issue, please do let me know cause this is very high priority for us (multiple teams).

I'm afraid I don't know what ABC is. I'm still relatively new here but I'd love to join if we could discuss it there. :)

@enxebre enxebre mentioned this pull request Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants