feat: Allow agent to handle signals for graceful termination #176

sanjayprabhu · 2025-10-03T00:23:58Z

The main issue with implementing sig term handling was with with the default gprc implementation, the requests would run in the child threads and in python, only the main thread could register signal handlers.

Initially looked at spawning a subprocess per request from the agent or switching the grpc server to use the ProcessPoolExecutor but both had issues. However, there's a cleaner way: switch to the asyncio grpc server. This runs all requests in the main thread.

Tested manually and added integration tests.

{"logged_at": "2025-10-03T00:22:49.736637+00:00", "isolate_source": "BRIDGE",
 "level": "TRACE", "message": "Isolate info: server 0.20.0, agent 0.20.0"}
{"logged_at": "2025-10-03T00:22:49.756782+00:00", "isolate_source": "BRIDGE",
 "level": "TRACE", "message": "Starting the execution of the function functio
n."}
^CTermination signal received, shutting down...
Shutting down, canceling all tasks...
Task 182833ca-e51d-4ef8-918a-33eb74c14dca finished with error: GRPCException(
'<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCod
e.CANCELLED\n\tdetails = "Channel closed!"\n\tdebug_error_string = "UNKNOWN:E
rror received from peer  {grpc_message:"Channel closed!", grpc_status:1}"\n>'
)
Terminating the agent process...
Agent process shutdown gracefully
All tasks canceled.
Server shut down

> ls sigterm_received
sigterm_received

See previous attempt here https://github.com/fal-ai/isolate/pull/174/files. The shutdown test is quite good, so re-used and extended it.

sanjayprabhu · 2025-10-03T00:26:01Z

src/isolate/server/server.py

+
+        self._shutting_down = True
+        print("Shutting down, canceling all tasks...")
+        self.cancel_tasks()


This cancels in sequence, and not parallel but it should be ok because we only run one at a time AFAIK?

sanjayprabhu · 2025-10-03T00:29:31Z

src/isolate/connections/grpc/agent.py

+    # There seems to be a weird bug with grpcio that makes subsequent requests fail with
+    # concurrent rpc limit exceeded if we set maximum_current_rpcs to 1. Setting it to 2
+    # fixes it, even though in practice, we only run one request at a time.
+    server = aio.server(


This is the main change, so all requests run in the main thread.

sanjayprabhu · 2025-10-03T00:29:52Z

src/isolate/connections/grpc/_base.py

+                print("Terminating the agent process...")
                process.terminate()
+                process.wait(timeout=PROCESS_SHUTDOWN_TIMEOUT)
+                print("Agent process shutdown gracefully")


Might be excessive logging, I can take it out. But it was useful for debugging.

sanjayprabhu · 2025-10-03T00:30:02Z

src/isolate/connections/grpc/_base.py

    """An internal problem caused by (most probably) the agent."""


+PROCESS_SHUTDOWN_TIMEOUT = 5  # seconds


What's a good default?

I think we can default to 60?

The question becomes how can we configure this from outside

I could add an optional field here to allow callers to override the default:

isolate/src/isolate/server/definitions/server.proto

Lines 24 to 29 in 64144e9

message BoundFunction {

repeated EnvironmentDefinition environments = 1;

SerializedObject function = 2;

optional SerializedObject setup_func = 3;

bool stream_logs = 4;

}

Defaulted to 60 and can be overridden via env var, I think this should work for now. Realized we can't add it to the request because it's possible for the agent to be re-used across functions.

tools/isolate_client.py

src/isolate/connections/grpc/_base.py

src/isolate/connections/grpc/agent.py

src/isolate/connections/grpc/_base.py

chamini2 · 2025-10-03T15:17:48Z

src/isolate/connections/grpc/_base.py

    """An internal problem caused by (most probably) the agent."""


+PROCESS_SHUTDOWN_TIMEOUT = 5  # seconds


I think we can default to 60?

The question becomes how can we configure this from outside

chamini2

This is great!

chamini2 · 2025-10-04T01:48:29Z

src/isolate/connections/grpc/_base.py


-PROCESS_SHUTDOWN_TIMEOUT = 5  # seconds
+PROCESS_SHUTDOWN_TIMEOUT_SECONDS = float(
+    os.getenv("ISOLATE_SHUTDOWN_GRACE_PERIOD", "60")


very good solution

chamini2 · 2025-10-04T01:50:03Z

src/isolate/server/server.py

+            if self.future and not self.future.running():
+                self.future.cancel()


But if we dont cancel it, then what happens?

I think in almost all cases, nothing. But there could be rare race conditions where the future that hasn't started yet starts executing after this leading to an orphaned agent process (more likely to happen when server is handling multiple tasks).

The chances are quite low, but it's more correct to always cancel imo.

I guess we can just do a log about this scenario then

chamini2 · 2025-10-04T01:51:16Z

tests/test_shutdown.py

+
+
+@pytest.fixture
+def isolate_server_subprocess(monkeypatch):


This reverts commit 42d04ec.

sanjayprabhu force-pushed the exit_handler branch from e3d8a7d to 2af430c Compare October 3, 2025 00:24

feat: Allow agent to handle signals for graceful termination

e77e535

sanjayprabhu force-pushed the exit_handler branch from 2af430c to e77e535 Compare October 3, 2025 00:28

sanjayprabhu commented Oct 3, 2025

View reviewed changes

chamini2 self-requested a review October 3, 2025 15:05

chamini2 reviewed Oct 3, 2025

View reviewed changes

Add tests, and fix a few agent termination edge cases

4961a58

sanjayprabhu force-pushed the exit_handler branch from a126bfe to 4961a58 Compare October 3, 2025 23:42

sanjayprabhu marked this pull request as ready for review October 3, 2025 23:54

sanjayprabhu force-pushed the exit_handler branch from b0a4f93 to 015cf6c Compare October 4, 2025 00:43

Fix flaky test

201d4b9

sanjayprabhu force-pushed the exit_handler branch from 015cf6c to 201d4b9 Compare October 4, 2025 00:43

chamini2 approved these changes Oct 4, 2025

View reviewed changes

sanjayprabhu force-pushed the exit_handler branch 2 times, most recently from a6885d6 to 201d4b9 Compare October 7, 2025 00:33

Allow nested event loops

42d04ec

sanjayprabhu force-pushed the exit_handler branch from 8acb8b4 to 42d04ec Compare October 7, 2025 22:21

Revert "Allow nested event loops"

3a3d108

This reverts commit 42d04ec.

sanjayprabhu mentioned this pull request Oct 8, 2025

feat: Add shutdown callback on sigterm for user code #177

Draft

		"""An internal problem caused by (most probably) the agent."""


		PROCESS_SHUTDOWN_TIMEOUT = 5 # seconds

	message BoundFunction {
	repeated EnvironmentDefinition environments = 1;
	SerializedObject function = 2;
	optional SerializedObject setup_func = 3;
	bool stream_logs = 4;
	}

		if self.future and not self.future.running():
		self.future.cancel()



		@pytest.fixture
		def isolate_server_subprocess(monkeypatch):

feat: Allow agent to handle signals for graceful termination #176

Are you sure you want to change the base?

feat: Allow agent to handle signals for graceful termination #176

Uh oh!

Conversation

sanjayprabhu commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamini2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanjayprabhu commented Oct 3, 2025 •

edited

Loading