Add ability to remove overflow workers after a delay by zugolosian · Pull Request #83 · devinus/poolboy

zugolosian · 2015-11-22T22:14:12Z

Hi,

I'm aware you've rejected similar proposals in the past, but my use case is slightly different. Here goes:

We're using poolboy to pool python workers in combination with erports. The reason for this is we write software that manages network attached storage. When network attached storage becomes unresponsive as it does from time to time you need to kill the process for the kernel to clean up properly. The reason we chose python workers is their path manipulation libraries are very good amongst other things

When load testing with poolboy we found that as soon as the pool went into overflow cpu on the machine went through the roof because of high worker churn. Having our workers start in an unconnected state isn't as useful as having a pool that optionally keeps workers around for a while under peak load because starting python workers is always expensive. Which means in order to provide good latency when not under load we'd have to either dynamically spin up small non-overflow pools or something. Also we run more than one type of worker on our servers and implementing the "disconnected" logic in many places when it could just be done in one seemed a bit silly.

I'm interested in hearing your thoughts on this.

Thanks

When workers are expensive to start and transactions are quick killing all workers that are checked in when in overflow is very expensive. This change allows delaying the termination of overflow workers when there is peak load and alleviates worker churn.

kevinbombadil · 2015-11-26T22:15:06Z

+1

yjh0502 · 2016-01-02T04:51:36Z

+1 poolboy behaves badly when spawning/killing new workers is rather expansive. I'm using poolboy with pgapp[1], and erlang node become unstable when pool overflows.

[1] https://github.com/epgsql/pgapp

zugolosian · 2016-01-06T19:20:02Z

@yjh0502 I'd be interested to know if the fork I made works for your use case.

davidw · 2016-01-06T19:24:54Z

I don't work with Erlang any longer (sadly), but would happily switch epgsql to use a fork of poolboy that handles things this way if it's something that gets maintained long term.

devinus · 2016-01-06T19:45:35Z

I'm going to branch off of this and with some improvements merge it.

zugolosian · 2016-01-06T19:47:14Z

Awesome, thanks!

devinus · 2016-01-17T19:25:58Z

@zugolosian I've been giving this a lot of review and I'm not sure the logic is quite right. Could you walk me through some of it?

I'm specifically worried about: https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR360

Why does the overflow stay the same here: https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR365

zugolosian · 2016-01-18T07:11:33Z

@devinus
https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR360
When we're in overflow and we have set an overflow ttl:
{empty, Empty} when Overflow > 0, OverflowTtl > 0 ->

We should start a timer, add it to the table of workers to reap and put it back in the list of available workers.
Overflow https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR365 shouldn't be decremented until the workers have been stopped. The workers will be stopped if not checked out within the overflow_ttl time period. We want anyone using the pool status to see that we're in a state of overflow but also that there are workers available for immediate use.

You were right to be confused though It should look more like:

        {empty, Empty} when Overflow > 0, OverflowTtl > 0 ->
            {ok, Tref} =
                timer:send_after(OverflowTtl, self(), {reap_worker, Pid}),
            true = ets:insert(WorkersToReap, {Pid, Tref}),
            Workers = case Strategy of
                          lifo -> [Pid | State#state.workers];
                          fifo -> State#state.workers ++ [Pid]
                      end,
            State#state{workers = Workers, waiting = Empty};

I've added some more tests and made the above changes and the other ones you'd made in your branch.

Conflicts: README.md

devinus · 2016-01-25T18:38:30Z

@zugolosian To sort of walk thru scenarios:

Checkout happens and overflow TTL exists, unset any timer before returning worker
Checkin happens and overflow TTL exists and we're in overflow but beyond max overflow, the worker should be dropped
Checkin happens and overflow TTL exists and we're not beyond max overflow, the timer should be set, overflow increased, and worker returned to pool

zugolosian · 2016-01-25T20:12:14Z

@devinus

* Checkout happens and overflow TTL exists, unset any timer before returning worker
Correct
* Checkin happens and overflow TTL exists and we're in overflow but beyond max overflow, the worker should be dropped
We should never be in a state where we're beyond max overflow, the logic in the checkouts doesn't allow that to happen
* Checkin happens and overflow TTL exists and we're not beyond max overflow, the timer should be set, overflow increased, and worker returned to pool
Overflow is only increased when an overflow worker has been started as part of the checkout. We're not starting any extra workers. For example:

We've got a pool size of 1 a max_overflow of 1 and an overflow_ttl of 300 milliseconds
We checkout a worker, overflow is zero
We checkout another, overflow is now 1
We checkin a worker. We start an erlang:send_after to schedule the worker to be reaped in 300 milliseconds. It is added back to the pool of workers and overflow is not touched. It is still set to 1.
- If someone attempts a checkout within 300 milliseconds they will get the worker that's in the pool and the timer will be canceled. When the worker is checked in again a new timer is started.
- If no one attempts a checkout within 300 milliseconds the worker will be dropped and only then will the overflow number be decreased

zugolosian · 2016-02-15T20:13:16Z

@devinus Let me know if you have any further questions.

Cheers

zugolosian · 2016-06-05T21:38:23Z

FYI, we've been running the above fork in production for 2 months now without issue.

comtihon · 2016-06-06T17:51:48Z

@zugolosian is it efficient to start and cancel timer on every overflow worker checkout?
I mean - under heavy load overflow worker can process lots of requests, going one by one. So there will be lot of unneeded timer creation and cancelling.
May be it is better to remember last time of use for overflow worker and kill old unused workers by some per second timer check?

zugolosian · 2016-06-06T19:36:44Z

@comtihon

That's a good idea, I never even considered it.

I think in most cases people using ttl will be connecting to an external resource and the cost of starting and stopping a timer won't be significant. Another thing is that if you reap workers at an interval you're immediately limiting the granularity of your ttl to the reap interval which you'd have to expose the user to.

I also suspect if you have a large number of overflow workers you could get into a state where poolboy spends all its time reaping. I don't know how efficient it will be to check the age of all overflow workers at an interval vs waiting for timers to expire. A timer seems more like it will have more predictable behaviour to me.

Is there a use case where you see the timer implementation being a problem?

comtihon · 2016-06-07T15:12:42Z

@zugolosian @devinus
I changed algorythm for more effective from my point of view.
As for reap workers interval - it is one second, or equal owerflow_ttl, as you can see zugolosian@8f8f5fe#diff-96c7a4d851dcecda493caf816793b18fR144
This reduces limit granularity of ttl interval.

From my point of view it is much more efficient to process workers ttl in one call on large numbers. You see - in your case large number of workers spawn large number of timers, that sends large numbers of rip_worker messages to poolboy process. It can flood much.

As you can see from zugolosian@8f8f5fe#diff-96c7a4d851dcecda493caf816793b18fR377
I iterate only for idle workers and only by overflow number. So I consider this to be efficient.

Should I create PR to devinus/poolboy?
It is PR to zugolosian/poolboy https://github.com/zugolosian/poolboy/pull/1

Added function to return richer pool status info

Added tests for full_status

* Ensure reap message already in mailbox is flushed when cancelling reap * Reap shouldn't be touching monitors, they are just for owner of checked out workers

Don't reap workers that are checked out again

David Leach added 2 commits November 20, 2015 22:49

David Leach added 2 commits January 18, 2016 22:24

Merge branch 'master' of https://github.com/zugolosian/poolboy

f6c4b25

Conflicts: README.md

Removing debug config

eb35c27

comtihon mentioned this pull request Jun 20, 2016

Improved logic for checking overflow workers #89

Open

David Leach and others added 8 commits August 6, 2016 18:50

Merge remote-tracking branch 'upstream/master'

e45fe5e

Added function to return richer pool status info

a99befe

Merge pull request #2 from walrusVision/full_status

8b44fa0

Added function to return richer pool status info

Added tests for full_status

0b5a810

Merge pull request #3 from walrusVision/full_status

d76ecc0

Added tests for full_status

Actually return proplists for status

cb3f9cf

Don't reap workers that are checked out again

2334498

* Ensure reap message already in mailbox is flushed when cancelling reap * Reap shouldn't be touching monitors, they are just for owner of checked out workers

Merge pull request #4 from GlenWalker/master

58b7c6e

Don't reap workers that are checked out again

zugolosian mentioned this pull request Jul 21, 2018

Ability to remove workers after delay, conversion of workers to queue, callback when a worker is checked in and full pool status #106

Open

zugolosian closed this Jul 23, 2018

Conversation

zugolosian commented Nov 22, 2015

Uh oh!

kevinbombadil commented Nov 26, 2015

Uh oh!

yjh0502 commented Jan 2, 2016

Uh oh!

zugolosian commented Jan 6, 2016

Uh oh!

davidw commented Jan 6, 2016

Uh oh!

devinus commented Jan 6, 2016

Uh oh!

zugolosian commented Jan 6, 2016

Uh oh!

devinus commented Jan 17, 2016

Uh oh!

zugolosian commented Jan 18, 2016

Uh oh!

devinus commented Jan 25, 2016

Uh oh!

zugolosian commented Jan 25, 2016

Uh oh!

zugolosian commented Feb 15, 2016

Uh oh!

zugolosian commented Jun 5, 2016

Uh oh!

comtihon commented Jun 6, 2016

Uh oh!

zugolosian commented Jun 6, 2016

Uh oh!

comtihon commented Jun 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants