-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry CPI calls in case of SSL_read errors as well #689
retry CPI calls in case of SSL_read errors as well #689
Conversation
We have the exact same issues, during delete_vm, set_vm_metadata and more calls, and it is blocking us from doing further upgrades. |
I guess I now have a more clear picture of what's happening. Weird broken connections can occur unexpectedly anytime. There is a difference how OpenSSL 1.x dealt with such broken connections resulting in unexpected EOFs and how OpenSSL 3.x deals with such issues. It seems that OpenSSL 3.x has a much more strict behavior about unexpected EOF. In a recent PR towards the bosh director, the OpenSSL version was bumped to 3.2.0, which seems to have woken up the giant which was always there sleeping, atleast with Azure. Other languges such as Python already have workarounds to deal with this issue intrinsically by default, but unfortunately that's not the case with ruby. There are further detailed explanations regarding the issue and how people have dealt with such issues:
In short, I feel retries should help in general in such cases and we should first try implementing retries. If not, people also suggest raising |
@@ -492,7 +492,7 @@ def apply_retry_policy(retry_data) | |||
if retry_data[:error].is_a?(OpenSSL::SSL::SSLError) || retry_data[:error].is_a?(OpenSSL::X509::StoreError) | |||
error_message = retry_data[:error].inspect | |||
|
|||
if error_message.include?(Bosh::AzureCloud::Helpers::ERROR_OPENSSL_RESET) | |||
if [Bosh::AzureCloud::Helpers::ERROR_OPENSSL_RESET, Bosh::AzureCloud::Helpers:ERROR_OPENSSL_EOF_READ].any? { |error| error_message.include?(error) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anshrupani I believe you are missing an additional colon between Helpers
and ERROR_OPENSSL_EOF_READ
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you're right, thanks. we're still testing the PR on our environment though
This issue made using the Azure CPI near impossible for us, as we'd run into constant SSL_read errors. When running a release with these changes, things have been working much better so far, turning the errors into retries (with the CPI log showcasing that this is happening quite a bit more than we'd like, but at least it doesn't kill the entire deployment anymore). |
Do you know what the concerns might be @anshrupani? |
f9b9903
to
9d40da2
Compare
I just remove this whole retry logic from this fix and it is still working. Let's test it again tomorrow and then get that in. |
Openssl.org mentions possibility of truncation attacks when this option is set and the protocol running over TLS doesn't handle such vulnerabilities. However, we should also consider that before OpenSSL3, having this option set was anyhow the case by default while we used Net::HTTP, and that other languages still have this option set by default for OpenSSL >= 3. |
fix failing unit tests related to PR #689
# Foundational Infrastructure: VM deployment lifecycle (BOSH) Contributions ### PRs Commented on/Reviewed: - 2024-05-02T00:33:35Z: [Refactor the `client` package](cloudfoundry/bosh-s3cli#38) ### Issues that may be relevant: - 2023-04-28T11:54:45Z: [BOSH Agent not reachable due to issues with blobstore settings](cloudfoundry/bosh-aws-cpi-release#149) - 2023-05-15T09:53:53Z: [Update CONTRIBUTING.md](cloudfoundry/bosh#2443) - 2023-05-16T08:07:09Z: [support instance tags and include relevant specs](cloudfoundry/bosh#2444) - 2023-05-24T05:01:42Z: [Prevent gcc version check for non xenial stemcells](cloudfoundry/bosh-azure-cpi-release#685) - 2023-05-31T17:49:02Z: [Support instance group tags](cloudfoundry/bosh#2448) - 2023-07-24T14:07:16Z: [Efficient development with the bosh-agent](cloudfoundry/bosh-agent#312) - 2023-08-14T14:17:13Z: [Enable support for verify-ca SSL mode for cases when hostname verification is not possible/required](cloudfoundry/bosh#2461) - 2023-08-14T14:20:56Z: [Enable support for verify-ca SSL mode for cases when hostname verification is not possible/required](cloudfoundry/bosh#2462) - 2023-08-17T11:42:28Z: [Use config properties with hyphens to prevent verification clashes in the agent](cloudfoundry/bosh-azure-storage-cli#9) - 2023-08-17T11:53:04Z: [Unifying azure storage cli config across director and agent to prevent conflicts](cloudfoundry/bosh#2465) - 2023-11-14T12:56:35Z: [Truncate downloaded blobs](cloudfoundry/bosh-azure-storage-cli#13) - 2023-11-16T10:19:43Z: [Adding disaster recovery documentation for bosh in case of AZ outages](cloudfoundry/docs-bosh#815) - 2023-12-18T14:27:56Z: [set server-side timeout for GET requests](cloudfoundry/bosh-azure-storage-cli#14) - 2024-02-09T16:27:17Z: [retry CPI calls in case of SSL_read errors as well](cloudfoundry/bosh-azure-cpi-release#689) - 2024-02-16T12:56:19Z: [fix failing unit tests](cloudfoundry/bosh-azure-cpi-release#690) - 2024-04-24T12:44:52Z: [Revert ssl read workaround](cloudfoundry/bosh-azure-cpi-release#694) - 2024-04-24T15:23:13Z: [support swift tempURLs](cloudfoundry/bosh-s3cli#37) - 2024-05-02T13:58:07Z: [fix swift tempURLs and add unit/integration tests](cloudfoundry/bosh-s3cli#39) ### Code contributions: - 2022-11-15T15:17:30Z: [Initial client offering Put functionality (cloudfoundry#2)](cloudfoundry/bosh-azure-storage-cli@9deb735) - 2022-11-16T07:40:28Z: [initial get functionality (cloudfoundry#3)](cloudfoundry/bosh-azure-storage-cli@77df9a7) - 2022-11-17T15:17:26Z: [initial exists functionality (cloudfoundry#6)](cloudfoundry/bosh-azure-storage-cli@e82b78b) - 2023-05-23T05:52:41Z: [prevent gcc version check for fast_jsonparser for bionic and jammy stemcells](cloudfoundry/bosh-azure-cpi-release@dfc9402) - 2023-05-24T05:05:08Z: [prevent gcc version check for fast_jsonparser for bionic and jammy stemcells](cloudfoundry/bosh-azure-cpi-release@dfc9402) - 2023-07-24T13:57:00Z: [Efficient development with the bosh-agent](cloudfoundry/bosh-agent@3bba88a) - 2023-07-24T14:09:00Z: [Efficient development with the bosh-agent](cloudfoundry/bosh-agent@3bba88a) - 2023-08-14T14:06:32Z: [Enable support for verify-full SSL mode for cases when hostname verification is not possible/required](cloudfoundry/bosh@45a062f) - 2023-08-17T11:39:50Z: [Use config properties with hyphens to prevent verification clashes in the agent](cloudfoundry/bosh-azure-storage-cli@1c46f43) - 2023-08-17T11:45:44Z: [Unifying azure storage cli config across director and agent to prevent conflicts](cloudfoundry/bosh@d5cc637) - 2023-11-16T10:12:44Z: [Adding disaster recovery documentation for bosh in case of AZ outages](cloudfoundry/docs-bosh@3427377) - 2023-12-06T20:07:15Z: [updated docs](cloudfoundry/docs-bosh@7be327e) - 2023-12-06T20:48:22Z: [updated docs](cloudfoundry/docs-bosh@7be327e) - 2023-12-18T14:21:43Z: [set server-side timeout for GET requests](cloudfoundry/bosh-azure-storage-cli@759862b) - 2023-12-18T15:09:12Z: [include comment for reasoning](cloudfoundry/bosh-azure-storage-cli@749bf5a) - 2023-12-19T09:10:09Z: [add timeout for PUT requests as well](cloudfoundry/bosh-azure-storage-cli@22372ec) - 2023-12-21T12:49:26Z: [include further changes and comments](cloudfoundry/docs-bosh@95ad588) - 2023-12-21T14:02:00Z: [include further changes and comments](cloudfoundry/docs-bosh@95ad588) - 2024-02-09T16:16:21Z: [retry CPI calls in case of SSL_read errors as well](cloudfoundry/bosh-azure-cpi-release@f291148) - 2024-02-13T12:49:12Z: [fixed typo](cloudfoundry/bosh-azure-cpi-release@d15f44a) - 2024-02-15T12:27:32Z: [make the changes more readable](cloudfoundry/bosh-azure-cpi-release@f7dd849) - 2024-02-16T12:54:10Z: [fix failing unit tests](cloudfoundry/bosh-azure-cpi-release@28f4ab3) - 2024-04-17T14:08:21Z: [support swift tempURLs](cloudfoundry/bosh-s3cli@c6fc710) - 2024-04-24T12:07:12Z: [Revert "fix failing unit tests"](cloudfoundry/bosh-azure-cpi-release@7a2e45f) - 2024-04-24T12:07:31Z: [Revert "Remove retry logic"](cloudfoundry/bosh-azure-cpi-release@1a6dad7) - 2024-04-24T12:39:47Z: [Revert "make the changes more readable"](cloudfoundry/bosh-azure-cpi-release@913ec21) - 2024-04-24T12:40:07Z: [Revert "Ignore EOF Error during ssl_read"](cloudfoundry/bosh-azure-cpi-release@7e8817c) - 2024-04-24T12:40:22Z: [Revert "fixed typo"](cloudfoundry/bosh-azure-cpi-release@b3f1c99) - 2024-04-24T12:40:33Z: [Revert "retry CPI calls in case of SSL_read errors as well"](cloudfoundry/bosh-azure-cpi-release@0b1b2bc) - 2024-05-02T13:43:27Z: [fix swift tempURLs and add unit/integration tests](cloudfoundry/bosh-s3cli@d3389dd) - 2024-05-02T14:01:48Z: [fix swift tempURLs and add unit/integration tests](cloudfoundry/bosh-s3cli@d3389dd)
Recently, intermittently broken connections, most probably due to IaaS network issues have been met with SSL_read errors as follows with various CPI calls (atleast noticed for delete_vm and attach_disk calls):
ERROR -- DirectorJobRunner: Unknown CPI error 'Unknown' with message 'SSL_read: unexpected eof while reading' in 'delete_vm' CPI method
On a bit of digging, it seems that almost all languages except ruby have implemented workarounds to meet with this common OpenSSL3 error (see this issue). At the same time, the azure-cpi has implemented retries for OpenSSL::SSL::SSLError class, however, this is only true for 'SSL_connect' errors (see the previous implementation in the PR).
With this PR, to circumvent such network issues, we also retry in case of 'SSL_read' errors.