You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
operator pod gets killed due to hitting thread limit reached as process keeps on spawning new threads and in loads condition pod gets killed in every 2 Hrs
Description
RuntimeError: can't start new thread
Traceback (most recent call last):
File "/usr/local/bin/kopf", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kopf/cli.py", line 30, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kopf/cli.py", line 61, in run
peering_name=peering_name,
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/queueing.py", line 275, in run
_reraise(loop, list(done1) + list(done2) + list(done3) + list(done4))
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/queueing.py", line 303, in _reraise
task.result() # can raise the regular (non-cancellation) exceptions.
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/queueing.py", line 81, in watcher
async for event in watching.infinite_watch(resource=resource, namespace=namespace):
File "/usr/local/lib/python3.7/site-packages/kopf/clients/watching.py", line 131, in infinite_watch
async for event in streaming_watch(resource=resource, namespace=namespace):
File "/usr/local/lib/python3.7/site-packages/kopf/clients/watching.py", line 93, in streaming_watch
async for event in streaming_aiter(stream, loop=loop):
File "/usr/local/lib/python3.7/site-packages/kopf/clients/watching.py", line 62, in streaming_aiter
yield await loop.run_in_executor(executor, streaming_next, src)
File "/usr/local/lib/python3.7/asyncio/base_events.py", line 747, in run_in_executor
executor.submit(func, *args), loop=self)
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 172, in submit
self._adjust_thread_count()
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 193, in _adjust_thread_count
t.start()
File "/usr/local/lib/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
Can you please clarify the version? Are you sure it is version 0.23.2?
I see this line:
File "/usr/local/lib/python3.7/site-packages/kopf/clients/watching.py", line 62, in streaming_aiter
yield await loop.run_in_executor(executor, streaming_next, src)
This sync approach was removed in #227 (line link), which was released as 0.23 (followed by 0.23.1 & 0.23.2). — And it was replaced with aiohttp-based cycles, which is natively async — and remains so since 0.23+.
Kopf uses a threaded executor of asyncio for synchronous handlers (declared with def, not async def).
Python's threaded executor adds one more thread on every use until max_workers is reached (link) — not all threads at once, but one-by-one.
The default max_workers is os.cpu_count() * 5 (link). So, on a MacBook, it can be 8 * 5 = 40 (perhaps, due to hyperthreading CPUs). On huge K8s nodes, it can be up to 40 * 5 = 200 threads (assuming 40 regular cores), or 2 * 40 * 5 = 400 (40 hyper-threading cores), or more and more.
It keeps the threaded workers running even if they are not used. And it does not reuse the idle workers for the next tasks (or maybe it does, but still adds extra idling workers until the limit is reached).
Maybe, at some level, it reaches the RAM limits of the pod and dies.
This can be controlled by using an already existing but undocumented config (link):
I was able to catch this on one of our operators which is supposed to do nothing during that time (but contained synchronous handlers for @kopf.on.event of pods) — and the thread count was growing overnight when it should be flat. I will try this trick above and see if it helps in the next few nights.
It did help. The thread count does not grow once the operator is started and used: it stays at 12 (10 were configured for the executor, 1 for the main thread, 1 for something Python's perhaps).
So far, the issue is a matter of asyncio's thread executor setup — can be done on the operator level as needed. I suggest that we have no own defaults or assumptions about the execution environment in the framework.
Tested with Kopf 0.25.
amolkavitkar Can you please check if this solutions helps you? (Also, please check the version.)
The text was updated successfully, but these errors were encountered:
Long story short
operator pod gets killed due to hitting thread limit reached as process keeps on spawning new threads and in loads condition pod gets killed in every 2 Hrs
Description
RuntimeError: can't start new thread
Traceback (most recent call last):
File "/usr/local/bin/kopf", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kopf/cli.py", line 30, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kopf/cli.py", line 61, in run
peering_name=peering_name,
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/queueing.py", line 275, in run
_reraise(loop, list(done1) + list(done2) + list(done3) + list(done4))
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/queueing.py", line 303, in _reraise
task.result() # can raise the regular (non-cancellation) exceptions.
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/queueing.py", line 81, in watcher
async for event in watching.infinite_watch(resource=resource, namespace=namespace):
File "/usr/local/lib/python3.7/site-packages/kopf/clients/watching.py", line 131, in infinite_watch
async for event in streaming_watch(resource=resource, namespace=namespace):
File "/usr/local/lib/python3.7/site-packages/kopf/clients/watching.py", line 93, in streaming_watch
async for event in streaming_aiter(stream, loop=loop):
File "/usr/local/lib/python3.7/site-packages/kopf/clients/watching.py", line 62, in streaming_aiter
yield await loop.run_in_executor(executor, streaming_next, src)
File "/usr/local/lib/python3.7/asyncio/base_events.py", line 747, in run_in_executor
executor.submit(func, *args), loop=self)
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 172, in submit
self._adjust_thread_count()
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 193, in _adjust_thread_count
t.start()
File "/usr/local/lib/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
root@s/# ps huH p 8 | wc -l
624
root@s:/# ps -o nlwp 8
NLWP
624
root@s:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 08:46 ? 00:00:00 /bin/sh -c kopf run --standalone /handlers.py
root 8 1 0 08:46 ? 00:00:13 /usr/local/bin/python /usr/local/bin/kopf run --standalone /handlers.py
root 634 0 0 11:00 pts/0 00:00:00 bash
root 649 634 0 11:01 pts/0 00:00:00 ps -ef
root@serviceendpoint-6b69949674-tvbj6:/# cat /proc/sys/kernel/threads-max
6180721
root@s:/# ps -o nlwp 8
NLWP
631
root@s:/# ps -o nlwp 8
NLWP
652
The exact command to reproduce the issue
The full output of the command that failed
Environment
Python packages installed
Hello. Thanks for reporting.
Can you please clarify the version? Are you sure it is version 0.23.2?
I see this line:
This sync approach was removed in #227 (line link), which was released as 0.23 (followed by 0.23.1 & 0.23.2). — And it was replaced with aiohttp-based cycles, which is natively async — and remains so since 0.23+.
One idea how this happens:
Kopf uses a threaded executor of asyncio for synchronous handlers (declared with
def
, notasync def
).Python's threaded executor adds one more thread on every use until max_workers is reached (link) — not all threads at once, but one-by-one.
The default max_workers is
os.cpu_count() * 5
(link). So, on a MacBook, it can be 8 * 5 = 40 (perhaps, due to hyperthreading CPUs). On huge K8s nodes, it can be up to 40 * 5 = 200 threads (assuming 40 regular cores), or 2 * 40 * 5 = 400 (40 hyper-threading cores), or more and more.It keeps the threaded workers running even if they are not used. And it does not reuse the idle workers for the next tasks (or maybe it does, but still adds extra idling workers until the limit is reached).
Maybe, at some level, it reaches the RAM limits of the pod and dies.
This can be controlled by using an already existing but undocumented config (link):
I was able to catch this on one of our operators which is supposed to do nothing during that time (but contained synchronous handlers for
@kopf.on.event
of pods) — and the thread count was growing overnight when it should be flat. I will try this trick above and see if it helps in the next few nights.It did help. The thread count does not grow once the operator is started and used: it stays at 12 (10 were configured for the executor, 1 for the main thread, 1 for something Python's perhaps).
So far, the issue is a matter of asyncio's thread executor setup — can be done on the operator level as needed. I suggest that we have no own defaults or assumptions about the execution environment in the framework.
Tested with Kopf 0.25.
amolkavitkar Can you please check if this solutions helps you? (Also, please check the version.)
The text was updated successfully, but these errors were encountered: