-
Notifications
You must be signed in to change notification settings - Fork 10
Description
We have noticed a situation where occasionally, a single instance of an app, will start raising bad_certificate exceptions when calling other apps in cloud foundry.
This error is due to a client certificate refusal by the go-router and only for apps that do not explicitly use mtls, meaning, the ones doing mtls with the go-router using the instance certificate.
We have thoroughly investigated this incident. Looking inside the app instance container shows a proper key/certificate pair, with proper validity. Restarting the app instance will eliminate the issue (at least for the following 24 hours), the errors don't start getting raised after the certificate expires, but after the certificate is renewed. Touching the certificate and/or key file in the mounted volume (through the diego-cell as root), will cause the filewatcher to be triggered and will also resolve the issue.
Our conclusion is that this is a racing-condition. The certificate file is always updated first and the key file follows. Most of the times, this update is almost instantaneous, but there are times where you can see a millisecond or two of difference. Since the watcher threads do not update the keystore in a synchronised block, we think this occurs because occasionally the first file watcher thread that is triggered, reads a mismatching private key file, and it finishes after the second file watching thread (which reads and updates the keystore with a matching key/certificate pair), updating the keystore with a mismatched key/certificate pair.
Such a situation will also match the observed behaviour (raised bad_certificate exceptions).
We have built our own custom security provider to try to log this behaviour (and other potential root causes), but (un)fortunately, all apps using our custom security provider always work. We do notice an occasional thread reading a mismatched key/certificate pair, but it's always the proper order, meaning, the thread reading the mismatched key/certificate pair always finishes first and the keystore is then properly updated with the second thread which will read a properly matched key/certificate pair.
This started occurring since 6 months ago, more or less. We run two foundations with 1600+ apps each and a total amount of instances above 3000 on each foundation. This can potentially happen each day to less than 1% of the instances. The main difference starting 6 months ago was using bionic stemcells and starting to use larger and newer vms for the diego-cells in azure (we also have aws foundations but they are not affected).
Since this began to become a real nuisance for us, we couldn't wait for our logging to show any hard proof of the racing condition, as it seems to be a typical heisenbug. The extra log entries and check operations might add just enough delay for the thread to always read a properly matched key/certificate pair.
I will add a pull request with a solution for this situation. While a simple synchronized block would already fix the issue, by guaranteeing that the second thread always finishes last, I prefer to instead update the keystore only if the private key matches the certificate, so that the theoretical case of a request being done in the microseconds that a key store with a mismatched key/certificate pair is updated doesn't occur.