Skip to content

Add ability to remove overflow workers after a delay#83

Closed
zugolosian wants to merge 12 commits intodevinus:masterfrom
zugolosian:master
Closed

Add ability to remove overflow workers after a delay#83
zugolosian wants to merge 12 commits intodevinus:masterfrom
zugolosian:master

Conversation

@zugolosian
Copy link

Hi,

I'm aware you've rejected similar proposals in the past, but my use case is slightly different. Here goes:

We're using poolboy to pool python workers in combination with erports. The reason for this is we write software that manages network attached storage. When network attached storage becomes unresponsive as it does from time to time you need to kill the process for the kernel to clean up properly. The reason we chose python workers is their path manipulation libraries are very good amongst other things

When load testing with poolboy we found that as soon as the pool went into overflow cpu on the machine went through the roof because of high worker churn. Having our workers start in an unconnected state isn't as useful as having a pool that optionally keeps workers around for a while under peak load because starting python workers is always expensive. Which means in order to provide good latency when not under load we'd have to either dynamically spin up small non-overflow pools or something. Also we run more than one type of worker on our servers and implementing the "disconnected" logic in many places when it could just be done in one seemed a bit silly.

I'm interested in hearing your thoughts on this.

Thanks

David Leach added 2 commits November 20, 2015 22:49
When workers are expensive to start and transactions are quick killing
all workers that are checked in when in overflow is very expensive.
This change allows delaying the termination of overflow workers when
there is peak load and alleviates worker churn.
When workers are expensive to start and transactions are quick killing
all workers that are checked in when in overflow is very expensive.
This change allows delaying the termination of overflow workers when
there is peak load and alleviates worker churn.
@kevinbombadil
Copy link

+1

@yjh0502
Copy link

yjh0502 commented Jan 2, 2016

+1 poolboy behaves badly when spawning/killing new workers is rather expansive. I'm using poolboy with pgapp[1], and erlang node become unstable when pool overflows.

[1] https://github.com/epgsql/pgapp

@zugolosian
Copy link
Author

@yjh0502 I'd be interested to know if the fork I made works for your use case.

@davidw
Copy link
Contributor

davidw commented Jan 6, 2016

I don't work with Erlang any longer (sadly), but would happily switch epgsql to use a fork of poolboy that handles things this way if it's something that gets maintained long term.

@devinus
Copy link
Owner

devinus commented Jan 6, 2016

I'm going to branch off of this and with some improvements merge it.

@zugolosian
Copy link
Author

Awesome, thanks!

@devinus
Copy link
Owner

devinus commented Jan 17, 2016

@zugolosian I've been giving this a lot of review and I'm not sure the logic is quite right. Could you walk me through some of it?

I'm specifically worried about: https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR360

Why does the overflow stay the same here: https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR365

@zugolosian
Copy link
Author

@devinus
https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR360
When we're in overflow and we have set an overflow ttl:
{empty, Empty} when Overflow > 0, OverflowTtl > 0 ->

We should start a timer, add it to the table of workers to reap and put it back in the list of available workers.
Overflow https://github.com/devinus/poolboy/pull/83/files#diff-96c7a4d851dcecda493caf816793b18fR365 shouldn't be decremented until the workers have been stopped. The workers will be stopped if not checked out within the overflow_ttl time period. We want anyone using the pool status to see that we're in a state of overflow but also that there are workers available for immediate use.

You were right to be confused though It should look more like:

        {empty, Empty} when Overflow > 0, OverflowTtl > 0 ->
            {ok, Tref} =
                timer:send_after(OverflowTtl, self(), {reap_worker, Pid}),
            true = ets:insert(WorkersToReap, {Pid, Tref}),
            Workers = case Strategy of
                          lifo -> [Pid | State#state.workers];
                          fifo -> State#state.workers ++ [Pid]
                      end,
            State#state{workers = Workers, waiting = Empty};

I've added some more tests and made the above changes and the other ones you'd made in your branch.

@devinus
Copy link
Owner

devinus commented Jan 25, 2016

@zugolosian To sort of walk thru scenarios:

  • Checkout happens and overflow TTL exists, unset any timer before returning worker
  • Checkin happens and overflow TTL exists and we're in overflow but beyond max overflow, the worker should be dropped
  • Checkin happens and overflow TTL exists and we're not beyond max overflow, the timer should be set, overflow increased, and worker returned to pool

@zugolosian
Copy link
Author

@devinus

* Checkout happens and overflow TTL exists, unset any timer before returning worker
Correct
* Checkin happens and overflow TTL exists and we're in overflow but beyond max overflow, the worker should be dropped
We should never be in a state where we're beyond max overflow, the logic in the checkouts doesn't allow that to happen
* Checkin happens and overflow TTL exists and we're not beyond max overflow, the timer should be set, overflow increased, and worker returned to pool
Overflow is only increased when an overflow worker has been started as part of the checkout. We're not starting any extra workers. For example:

  • We've got a pool size of 1 a max_overflow of 1 and an overflow_ttl of 300 milliseconds
  • We checkout a worker, overflow is zero
  • We checkout another, overflow is now 1
  • We checkin a worker. We start an erlang:send_after to schedule the worker to be reaped in 300 milliseconds. It is added back to the pool of workers and overflow is not touched. It is still set to 1.
    • If someone attempts a checkout within 300 milliseconds they will get the worker that's in the pool and the timer will be canceled. When the worker is checked in again a new timer is started.
    • If no one attempts a checkout within 300 milliseconds the worker will be dropped and only then will the overflow number be decreased

@zugolosian
Copy link
Author

@devinus Let me know if you have any further questions.

Cheers

@zugolosian
Copy link
Author

FYI, we've been running the above fork in production for 2 months now without issue.

@comtihon
Copy link

comtihon commented Jun 6, 2016

@zugolosian is it efficient to start and cancel timer on every overflow worker checkout?
I mean - under heavy load overflow worker can process lots of requests, going one by one. So there will be lot of unneeded timer creation and cancelling.
May be it is better to remember last time of use for overflow worker and kill old unused workers by some per second timer check?

@zugolosian
Copy link
Author

@comtihon

That's a good idea, I never even considered it.

I think in most cases people using ttl will be connecting to an external resource and the cost of starting and stopping a timer won't be significant. Another thing is that if you reap workers at an interval you're immediately limiting the granularity of your ttl to the reap interval which you'd have to expose the user to.

I also suspect if you have a large number of overflow workers you could get into a state where poolboy spends all its time reaping. I don't know how efficient it will be to check the age of all overflow workers at an interval vs waiting for timers to expire. A timer seems more like it will have more predictable behaviour to me.

Is there a use case where you see the timer implementation being a problem?

@comtihon
Copy link

comtihon commented Jun 7, 2016

@zugolosian @devinus
I changed algorythm for more effective from my point of view.
As for reap workers interval - it is one second, or equal owerflow_ttl, as you can see zugolosian@8f8f5fe#diff-96c7a4d851dcecda493caf816793b18fR144
This reduces limit granularity of ttl interval.

From my point of view it is much more efficient to process workers ttl in one call on large numbers. You see - in your case large number of workers spawn large number of timers, that sends large numbers of rip_worker messages to poolboy process. It can flood much.

As you can see from zugolosian@8f8f5fe#diff-96c7a4d851dcecda493caf816793b18fR377
I iterate only for idle workers and only by overflow number. So I consider this to be efficient.

Should I create PR to devinus/poolboy?
It is PR to zugolosian/poolboy https://github.com/zugolosian/poolboy/pull/1

David Leach and others added 8 commits August 6, 2016 18:50
Added function to return richer pool status info
* Ensure reap message already in mailbox is flushed when cancelling reap
* Reap shouldn't be touching monitors, they are just for owner of checked out workers
Don't reap workers that are checked out again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants