Skip to content

Conversation

@sanjayprabhu
Copy link

@sanjayprabhu sanjayprabhu commented Oct 3, 2025

The main issue with implementing sig term handling was with with the default gprc implementation, the requests would run in the child threads and in python, only the main thread could register signal handlers.

Initially looked at spawning a subprocess per request from the agent or switching the grpc server to use the ProcessPoolExecutor but both had issues. However, there's a cleaner way: switch to the asyncio grpc server. This runs all requests in the main thread.

Tested manually and added integration tests.

{"logged_at": "2025-10-03T00:22:49.736637+00:00", "isolate_source": "BRIDGE",
 "level": "TRACE", "message": "Isolate info: server 0.20.0, agent 0.20.0"}
{"logged_at": "2025-10-03T00:22:49.756782+00:00", "isolate_source": "BRIDGE",
 "level": "TRACE", "message": "Starting the execution of the function functio
n."}
^CTermination signal received, shutting down...
Shutting down, canceling all tasks...
Task 182833ca-e51d-4ef8-918a-33eb74c14dca finished with error: GRPCException(
'<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCod
e.CANCELLED\n\tdetails = "Channel closed!"\n\tdebug_error_string = "UNKNOWN:E
rror received from peer  {grpc_message:"Channel closed!", grpc_status:1}"\n>'
)
Terminating the agent process...
Agent process shutdown gracefully
All tasks canceled.
Server shut down

> ls sigterm_received
sigterm_received

See previous attempt here https://github.com/fal-ai/isolate/pull/174/files. The shutdown test is quite good, so re-used and extended it.


self._shutting_down = True
print("Shutting down, canceling all tasks...")
self.cancel_tasks()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cancels in sequence, and not parallel but it should be ok because we only run one at a time AFAIK?

# There seems to be a weird bug with grpcio that makes subsequent requests fail with
# concurrent rpc limit exceeded if we set maximum_current_rpcs to 1. Setting it to 2
# fixes it, even though in practice, we only run one request at a time.
server = aio.server(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change, so all requests run in the main thread.

print("Terminating the agent process...")
process.terminate()
process.wait(timeout=PROCESS_SHUTDOWN_TIMEOUT)
print("Agent process shutdown gracefully")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be excessive logging, I can take it out. But it was useful for debugging.

"""An internal problem caused by (most probably) the agent."""


PROCESS_SHUTDOWN_TIMEOUT = 5 # seconds
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a good default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can default to 60?

The question becomes how can we configure this from outside

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add an optional field here to allow callers to override the default:

message BoundFunction {
repeated EnvironmentDefinition environments = 1;
SerializedObject function = 2;
optional SerializedObject setup_func = 3;
bool stream_logs = 4;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulted to 60 and can be overridden via env var, I think this should work for now. Realized we can't add it to the request because it's possible for the agent to be re-used across functions.

@chamini2 chamini2 self-requested a review October 3, 2025 15:05
"""An internal problem caused by (most probably) the agent."""


PROCESS_SHUTDOWN_TIMEOUT = 5 # seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can default to 60?

The question becomes how can we configure this from outside

@sanjayprabhu sanjayprabhu marked this pull request as ready for review October 3, 2025 23:54
Copy link
Member

@chamini2 chamini2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!


PROCESS_SHUTDOWN_TIMEOUT = 5 # seconds
PROCESS_SHUTDOWN_TIMEOUT_SECONDS = float(
os.getenv("ISOLATE_SHUTDOWN_GRACE_PERIOD", "60")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good solution

Comment on lines +184 to +185
if self.future and not self.future.running():
self.future.cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we dont cancel it, then what happens?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in almost all cases, nothing. But there could be rare race conditions where the future that hasn't started yet starts executing after this leading to an orphaned agent process (more likely to happen when server is handling multiple tasks).

The chances are quite low, but it's more correct to always cancel imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can just do a log about this scenario then



@pytest.fixture
def isolate_server_subprocess(monkeypatch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@sanjayprabhu sanjayprabhu force-pushed the exit_handler branch 2 times, most recently from a6885d6 to 201d4b9 Compare October 7, 2025 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants