Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove a connection if it receives any messages while in the pool #273

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

keathley
Copy link
Collaborator

Inspired by #272 I started looking into issues where idle connections are closed but not removed from the pool. Deep down I still feel like we need a packet capture to really suss out what's going on. Even if the pool hadn't processed a message telling us the tcp socket was closed, we should get an einval error when we attempted to put the connection into passive mode (at least in my testing on macOS).

That said, it occurred to me that a connection sitting in the pool should not be receiving data (since its finished the requests it needed to make and is back in the pool). Thus we can treat any message intended for that connection as an error and remove it from the pool.

Let me know what you think @sneako or if I missed something here.

@sneako
Copy link
Owner

sneako commented May 23, 2024

Hey @keathley ! Thank you so much for taking a lot at this. That definitely makes a lot of sense to me and seems pretty obvious in retrospect.

I can deploy this PR to a production environment that makes extremely heavy use of Finch H1 pools and test out this hypothesis.

I'll report back with results ASAP!

@sneako
Copy link
Owner

sneako commented May 24, 2024

I deployed this to a single node in a cluster where I see occasional :closed errors.

The good news is that I am not noticing any negative side effects of this change.

On the other hand, there doesn't seem to be any difference in the rate of :closed errors. Below are two charts, one showing the :closed error count for the entire cluster, and the other showing the error count just on the test node.

The vertical gray bar just after 13:00 marks when I deployed this change to the test node.

image

Now, each of these nodes was making > 7k Finch H1 requests per second during this test, so this :closed error rate is really quite negligible for me, but I assume the error rates some users are experiencing are much higher, so maybe this change will have more of an impact for those users.

@sneako
Copy link
Owner

sneako commented Jun 3, 2024

Definitely interested to see if this change helps anyone else who is seeing a higher rate of :closed errors. If anyone experiencing this issue could test out this PR and report back that would be greatly appreciated. As stated above, I can attest that no negative side effects were observed when I deployed this change to prod, so it should be a completely safe test to run for anyone interested.

@elijahkim
Copy link

Hi! Coming from wojtekmach/req#312 (comment). I tried running this branch in a preview environment. I made an API request, waited a bit, and made another one which resulted in this error. I'm going to obfuscate some stuff but this is what I did.

Interactive Elixir (1.16.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> API.print("ID")
{:ok, <<37, 80,  ...>>}
iex(2)> NaiveDateTime.utc_now()
~N[2024-06-04 22:52:06.582967]
iex(3)> API.print("ID")

22:59:12.668 [error] retry: got exception, will retry in 1000ms, 3 attempts left

22:59:12.668 [error] ** (Req.TransportError) socket closed
{:ok, <<37, 80,  ...>>}
iex(4)> NaiveDateTime.utc_now()
~N[2024-06-04 22:59:30.035642]
iex(5)> API.print("ID")
{:ok, <<37, 80, ...>>}
iex(6)> NaiveDateTime.utc_now()
~N[2024-06-04 23:03:43.808728]
iex(7)> API.print("ID")

23:10:13.518 [error] retry: got exception, will retry in 1000ms, 3 attempts left

23:10:13.518 [error] ** (Req.TransportError) socket closed
{:ok, <<37, 80,  ...>>}
iex(8)> NaiveDateTime.utc_now()
~N[2024-06-04 23:10:21.185392]
iex(9)> API.print("ID")
{:ok, <<37, 80,  ...>>}
iex(10)> Hiive.Persona.Inquiries.print("ID")
{:ok, <<37, 80,  ...>>}

subsequent calls without waiting don't fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants