Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry CPI calls in case of SSL_read errors as well #689

Merged
merged 5 commits into from
Feb 16, 2024

Conversation

anshrupani
Copy link
Contributor

Recently, intermittently broken connections, most probably due to IaaS network issues have been met with SSL_read errors as follows with various CPI calls (atleast noticed for delete_vm and attach_disk calls):

ERROR -- DirectorJobRunner: Unknown CPI error 'Unknown' with message 'SSL_read: unexpected eof while reading' in 'delete_vm' CPI method

On a bit of digging, it seems that almost all languages except ruby have implemented workarounds to meet with this common OpenSSL3 error (see this issue). At the same time, the azure-cpi has implemented retries for OpenSSL::SSL::SSLError class, however, this is only true for 'SSL_connect' errors (see the previous implementation in the PR).

With this PR, to circumvent such network issues, we also retry in case of 'SSL_read' errors.

@metskem
Copy link

metskem commented Feb 13, 2024

We have the exact same issues, during delete_vm, set_vm_metadata and more calls, and it is blocking us from doing further upgrades.

@anshrupani
Copy link
Contributor Author

I guess I now have a more clear picture of what's happening.

Weird broken connections can occur unexpectedly anytime. There is a difference how OpenSSL 1.x dealt with such broken connections resulting in unexpected EOFs and how OpenSSL 3.x deals with such issues. It seems that OpenSSL 3.x has a much more strict behavior about unexpected EOF. In a recent PR towards the bosh director, the OpenSSL version was bumped to 3.2.0, which seems to have woken up the giant which was always there sleeping, atleast with Azure. Other languges such as Python already have workarounds to deal with this issue intrinsically by default, but unfortunately that's not the case with ruby. There are further detailed explanations regarding the issue and how people have dealt with such issues:

  1. https://stackoverflow.com/questions/76183622/since-a-ruby-container-upgrade-we-expirience-a-lot-of-opensslsslsslerror
  2. OpenSSL::SSL::SSLError: SSL_read: unexpected eof while reading redis/redis-rb#1106

In short, I feel retries should help in general in such cases and we should first try implementing retries. If not, people also suggest raising OpenSSL::SSL::OP_IGNORE_UNEXPECTED_EOF flag as a workaround, but that comes with other possible concerns.

@@ -492,7 +492,7 @@ def apply_retry_policy(retry_data)
if retry_data[:error].is_a?(OpenSSL::SSL::SSLError) || retry_data[:error].is_a?(OpenSSL::X509::StoreError)
error_message = retry_data[:error].inspect

if error_message.include?(Bosh::AzureCloud::Helpers::ERROR_OPENSSL_RESET)
if [Bosh::AzureCloud::Helpers::ERROR_OPENSSL_RESET, Bosh::AzureCloud::Helpers:ERROR_OPENSSL_EOF_READ].any? { |error| error_message.include?(error) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshrupani I believe you are missing an additional colon between Helpers and ERROR_OPENSSL_EOF_READ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're right, thanks. we're still testing the PR on our environment though

@lodener
Copy link
Contributor

lodener commented Feb 13, 2024

This issue made using the Azure CPI near impossible for us, as we'd run into constant SSL_read errors.

When running a release with these changes, things have been working much better so far, turning the errors into retries (with the CPI log showcasing that this is happening quite a bit more than we'd like, but at least it doesn't kill the entire deployment anymore).

@anshrupani anshrupani marked this pull request as ready for review February 15, 2024 08:15
@jpalermo
Copy link
Member

If not, people also suggest raising OpenSSL::SSL::OP_IGNORE_UNEXPECTED_EOF flag as a workaround, but that comes with other possible concerns.

Do you know what the concerns might be @anshrupani?

@mvach
Copy link
Contributor

mvach commented Feb 15, 2024

I just remove this whole retry logic from this fix and it is still working. Let's test it again tomorrow and then get that in.

@anshrupani
Copy link
Contributor Author

If not, people also suggest raising OpenSSL::SSL::OP_IGNORE_UNEXPECTED_EOF flag as a workaround, but that comes with other possible concerns.

Do you know what the concerns might be @anshrupani?

Openssl.org mentions possibility of truncation attacks when this option is set and the protocol running over TLS doesn't handle such vulnerabilities. However, we should also consider that before OpenSSL3, having this option set was anyhow the case by default while we used Net::HTTP, and that other languages still have this option set by default for OpenSSL >= 3.

@ramonskie ramonskie merged commit 3cc59df into cloudfoundry:master Feb 16, 2024
1 check passed
ramonskie added a commit that referenced this pull request Feb 16, 2024
fix failing unit tests related to PR #689
max-soe added a commit to max-soe/community that referenced this pull request Jun 20, 2024
# Foundational Infrastructure: VM deployment lifecycle (BOSH) Contributions
### PRs Commented on/Reviewed:
- 2024-05-02T00:33:35Z: [Refactor the `client` package](cloudfoundry/bosh-s3cli#38)
### Issues that may be relevant:
- 2023-04-28T11:54:45Z: [BOSH Agent not reachable due to issues with blobstore settings](cloudfoundry/bosh-aws-cpi-release#149)
- 2023-05-15T09:53:53Z: [Update CONTRIBUTING.md](cloudfoundry/bosh#2443)
- 2023-05-16T08:07:09Z: [support instance tags and include relevant specs](cloudfoundry/bosh#2444)
- 2023-05-24T05:01:42Z: [Prevent gcc version check for non xenial stemcells](cloudfoundry/bosh-azure-cpi-release#685)
- 2023-05-31T17:49:02Z: [Support instance group tags](cloudfoundry/bosh#2448)
- 2023-07-24T14:07:16Z: [Efficient development with the bosh-agent](cloudfoundry/bosh-agent#312)
- 2023-08-14T14:17:13Z: [Enable support for verify-ca SSL mode for cases when hostname verification is not possible/required](cloudfoundry/bosh#2461)
- 2023-08-14T14:20:56Z: [Enable support for verify-ca SSL mode for cases when hostname verification is not possible/required](cloudfoundry/bosh#2462)
- 2023-08-17T11:42:28Z: [Use config properties with hyphens to prevent verification clashes in the agent](cloudfoundry/bosh-azure-storage-cli#9)
- 2023-08-17T11:53:04Z: [Unifying azure storage cli config across director and agent to prevent conflicts](cloudfoundry/bosh#2465)
- 2023-11-14T12:56:35Z: [Truncate downloaded blobs](cloudfoundry/bosh-azure-storage-cli#13)
- 2023-11-16T10:19:43Z: [Adding disaster recovery documentation for bosh in case of AZ outages](cloudfoundry/docs-bosh#815)
- 2023-12-18T14:27:56Z: [set server-side timeout for GET requests](cloudfoundry/bosh-azure-storage-cli#14)
- 2024-02-09T16:27:17Z: [retry CPI calls in case of SSL_read errors as well](cloudfoundry/bosh-azure-cpi-release#689)
- 2024-02-16T12:56:19Z: [fix failing unit tests](cloudfoundry/bosh-azure-cpi-release#690)
- 2024-04-24T12:44:52Z: [Revert ssl read workaround](cloudfoundry/bosh-azure-cpi-release#694)
- 2024-04-24T15:23:13Z: [support swift tempURLs](cloudfoundry/bosh-s3cli#37)
- 2024-05-02T13:58:07Z: [fix swift tempURLs and add unit/integration tests](cloudfoundry/bosh-s3cli#39)
### Code contributions:
- 2022-11-15T15:17:30Z: [Initial client offering Put functionality (cloudfoundry#2)](cloudfoundry/bosh-azure-storage-cli@9deb735)
- 2022-11-16T07:40:28Z: [initial get functionality (cloudfoundry#3)](cloudfoundry/bosh-azure-storage-cli@77df9a7)
- 2022-11-17T15:17:26Z: [initial exists functionality (cloudfoundry#6)](cloudfoundry/bosh-azure-storage-cli@e82b78b)
- 2023-05-23T05:52:41Z: [prevent gcc version check for fast_jsonparser for bionic and jammy stemcells](cloudfoundry/bosh-azure-cpi-release@dfc9402)
- 2023-05-24T05:05:08Z: [prevent gcc version check for fast_jsonparser for bionic and jammy stemcells](cloudfoundry/bosh-azure-cpi-release@dfc9402)
- 2023-07-24T13:57:00Z: [Efficient development with the bosh-agent](cloudfoundry/bosh-agent@3bba88a)
- 2023-07-24T14:09:00Z: [Efficient development with the bosh-agent](cloudfoundry/bosh-agent@3bba88a)
- 2023-08-14T14:06:32Z: [Enable support for verify-full SSL mode for cases when hostname verification is not possible/required](cloudfoundry/bosh@45a062f)
- 2023-08-17T11:39:50Z: [Use config properties with hyphens to prevent verification clashes in the agent](cloudfoundry/bosh-azure-storage-cli@1c46f43)
- 2023-08-17T11:45:44Z: [Unifying azure storage cli config across director and agent to prevent conflicts](cloudfoundry/bosh@d5cc637)
- 2023-11-16T10:12:44Z: [Adding disaster recovery documentation for bosh in case of AZ outages](cloudfoundry/docs-bosh@3427377)
- 2023-12-06T20:07:15Z: [updated docs](cloudfoundry/docs-bosh@7be327e)
- 2023-12-06T20:48:22Z: [updated docs](cloudfoundry/docs-bosh@7be327e)
- 2023-12-18T14:21:43Z: [set server-side timeout for GET requests](cloudfoundry/bosh-azure-storage-cli@759862b)
- 2023-12-18T15:09:12Z: [include comment for reasoning](cloudfoundry/bosh-azure-storage-cli@749bf5a)
- 2023-12-19T09:10:09Z: [add timeout for PUT requests as well](cloudfoundry/bosh-azure-storage-cli@22372ec)
- 2023-12-21T12:49:26Z: [include further changes and comments](cloudfoundry/docs-bosh@95ad588)
- 2023-12-21T14:02:00Z: [include further changes and comments](cloudfoundry/docs-bosh@95ad588)
- 2024-02-09T16:16:21Z: [retry CPI calls in case of SSL_read errors as well](cloudfoundry/bosh-azure-cpi-release@f291148)
- 2024-02-13T12:49:12Z: [fixed typo](cloudfoundry/bosh-azure-cpi-release@d15f44a)
- 2024-02-15T12:27:32Z: [make the changes more readable](cloudfoundry/bosh-azure-cpi-release@f7dd849)
- 2024-02-16T12:54:10Z: [fix failing unit tests](cloudfoundry/bosh-azure-cpi-release@28f4ab3)
- 2024-04-17T14:08:21Z: [support swift tempURLs](cloudfoundry/bosh-s3cli@c6fc710)
- 2024-04-24T12:07:12Z: [Revert "fix failing unit tests"](cloudfoundry/bosh-azure-cpi-release@7a2e45f)
- 2024-04-24T12:07:31Z: [Revert "Remove retry logic"](cloudfoundry/bosh-azure-cpi-release@1a6dad7)
- 2024-04-24T12:39:47Z: [Revert "make the changes more readable"](cloudfoundry/bosh-azure-cpi-release@913ec21)
- 2024-04-24T12:40:07Z: [Revert "Ignore EOF Error during ssl_read"](cloudfoundry/bosh-azure-cpi-release@7e8817c)
- 2024-04-24T12:40:22Z: [Revert "fixed typo"](cloudfoundry/bosh-azure-cpi-release@b3f1c99)
- 2024-04-24T12:40:33Z: [Revert "retry CPI calls in case of SSL_read errors as well"](cloudfoundry/bosh-azure-cpi-release@0b1b2bc)
- 2024-05-02T13:43:27Z: [fix swift tempURLs and add unit/integration tests](cloudfoundry/bosh-s3cli@d3389dd)
- 2024-05-02T14:01:48Z: [fix swift tempURLs and add unit/integration tests](cloudfoundry/bosh-s3cli@d3389dd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

7 participants