Add ability to remove overflow workers after a delay#83
Add ability to remove overflow workers after a delay#83zugolosian wants to merge 12 commits intodevinus:masterfrom
Conversation
When workers are expensive to start and transactions are quick killing all workers that are checked in when in overflow is very expensive. This change allows delaying the termination of overflow workers when there is peak load and alleviates worker churn.
When workers are expensive to start and transactions are quick killing all workers that are checked in when in overflow is very expensive. This change allows delaying the termination of overflow workers when there is peak load and alleviates worker churn.
|
+1 |
|
+1 poolboy behaves badly when spawning/killing new workers is rather expansive. I'm using poolboy with pgapp[1], and erlang node become unstable when pool overflows. |
|
@yjh0502 I'd be interested to know if the fork I made works for your use case. |
|
I don't work with Erlang any longer (sadly), but would happily switch epgsql to use a fork of poolboy that handles things this way if it's something that gets maintained long term. |
|
I'm going to branch off of this and with some improvements merge it. |
|
Awesome, thanks! |
|
@zugolosian I've been giving this a lot of review and I'm not sure the logic is quite right. Could you walk me through some of it? I'm specifically worried about: https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR360 Why does the overflow stay the same here: https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR365 |
|
@devinus We should start a timer, add it to the table of workers to reap and put it back in the list of available workers. You were right to be confused though It should look more like: I've added some more tests and made the above changes and the other ones you'd made in your branch. |
Conflicts: README.md
|
@zugolosian To sort of walk thru scenarios:
|
|
|
|
@devinus Let me know if you have any further questions. Cheers |
|
FYI, we've been running the above fork in production for 2 months now without issue. |
|
@zugolosian is it efficient to start and cancel timer on every overflow worker checkout? |
|
That's a good idea, I never even considered it. I think in most cases people using ttl will be connecting to an external resource and the cost of starting and stopping a timer won't be significant. Another thing is that if you reap workers at an interval you're immediately limiting the granularity of your ttl to the reap interval which you'd have to expose the user to. I also suspect if you have a large number of overflow workers you could get into a state where poolboy spends all its time reaping. I don't know how efficient it will be to check the age of all overflow workers at an interval vs waiting for timers to expire. A timer seems more like it will have more predictable behaviour to me. Is there a use case where you see the timer implementation being a problem? |
|
@zugolosian @devinus From my point of view it is much more efficient to process workers ttl in one call on large numbers. You see - in your case large number of workers spawn large number of timers, that sends large numbers of rip_worker messages to poolboy process. It can flood much. As you can see from zugolosian@8f8f5fe#diff-96c7a4d851dcecda493caf816793b18fR377 Should I create PR to devinus/poolboy? |
Added function to return richer pool status info
Added tests for full_status
* Ensure reap message already in mailbox is flushed when cancelling reap * Reap shouldn't be touching monitors, they are just for owner of checked out workers
Don't reap workers that are checked out again
Hi,
I'm aware you've rejected similar proposals in the past, but my use case is slightly different. Here goes:
We're using poolboy to pool python workers in combination with erports. The reason for this is we write software that manages network attached storage. When network attached storage becomes unresponsive as it does from time to time you need to kill the process for the kernel to clean up properly. The reason we chose python workers is their path manipulation libraries are very good amongst other things
When load testing with poolboy we found that as soon as the pool went into overflow cpu on the machine went through the roof because of high worker churn. Having our workers start in an unconnected state isn't as useful as having a pool that optionally keeps workers around for a while under peak load because starting python workers is always expensive. Which means in order to provide good latency when not under load we'd have to either dynamically spin up small non-overflow pools or something. Also we run more than one type of worker on our servers and implementing the "disconnected" logic in many places when it could just be done in one seemed a bit silly.
I'm interested in hearing your thoughts on this.
Thanks