-
Notifications
You must be signed in to change notification settings - Fork 118
Dan/bump pq #2858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dan/bump pq #2858
Conversation
To explain the approach here: It seems the automatic reconnect and retry logic in lib/pq is kinda dangerous as-is: lib/pq#939 and attempts to bring back the previous retrying behavior petered out because of safety concerns: lib/pq#871 So it seems the best path forward is to assume the retrying behavior is not coming back in lib/pq, or at least not soon. I also think it's fine for Automate to return one 500 after a database disconnect/reconnnect. If you agree with that then the sensible thing to do is to make our tests a little more forgiving of the occasional 500 that we would expect to see from the database resets. |
7922677
to
0e7ecc8
Compare
integration/tests/cluster.sh
Outdated
# Inspec tests are less tolerant of transient 500s; we expect a few of | ||
# those as bad postgres connections get purged from the various services. | ||
# Restart everything before running the tests | ||
docker exec -t "$_frontend1_container_name" "$cli_bin" restart-services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure i understand whats happening here. We've just done a deploy. Why does restarting everything help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frontends come up with broken Postgres connections, so the Inspec tests get exactly 2 500s, see for example https://buildkite.com/chef/chef-automate-master-verify-private/builds/9454#d349a780-69be-4a0f-ac17-10f6ae8f2475. Retrying 500s in the Inspec would be the better solution but it's a lot harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone who might stumble on this later, the code comment above is the correct answer--an HAProxy somewhere has the idle timeout set to 5 minutes, which kills the connections on frontend1 during the more-than-five-minutes when frontend2 is being deployed.
d8bf7e7
to
0d35487
Compare
@@ -5,3 +5,5 @@ status_port = 10146 | |||
|
|||
[timeouts] | |||
connect = 5 | |||
# 12 hours | |||
idle = 43200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we wire this up to the config so its easy to change
2f1161c
to
c3eeb38
Compare
Full set of changes: lib/pq@83612a5...3427c32 Highlights include: - A fix for a potential deadlock when related to commit and rollback - QuoteLiteral function that can replace our custom implementation Signed-off-by: Steven Danna <steve@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
f39f569
to
7f00294
Compare
Signed-off-by: Daniel DeLeo <dan@chef.io>
ed041a1
to
df2e8ff
Compare
Signed-off-by: Daniel DeLeo <dan@chef.io>
df2e8ff
to
730b0dd
Compare
🔩 Description: What code changed, and why?
Giving it another go to see what happens. Probably won't actively work on it a lot if it's not easy so feel free to take over if you happen upon this PR and need it done.
⛓️ Related Resources
👍 Definition of Done
👟 How to Build and Test the Change
✅ Checklist
📷 Screenshots, if applicable