Fix deadlock in `CallbackList`. #21701

hertschuh · 2025-09-29T18:27:34Z

A memory leak related to the executor in CallbackList was fixed in #20779

However, calling Executor.shutdown within __del__ is intrisincally unsafe and can create deadlocks because the garbage collector can be called in different contexts.

This new approach uses the on_train/test/predict_begin and on_train/test/predict_end callbacks to detect when we're done with the executor.

it counts the number of "begin"s and "end"s to handle the case of evaluate within fit (we do not shutdown the executor at the end of evaluate but instead keep it around for the rest of the training)
it also handles CallbackList being reused between calls to fit, evaluate or predict even though Keras doesn't reuse.

Also renamed _clear_futures to _flush_futures to make it clear futures are not discarded, but exectuted.

gemini-code-assist · 2025-09-29T18:27:50Z

Summary of Changes

Hello @hertschuh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical deadlock vulnerability within the CallbackList class by overhauling the management of its internal ThreadPoolExecutor. It replaces the problematic __del__ method-based shutdown with a more robust, event-driven approach. By tracking the start and end of training, testing, and prediction phases, the executor is now safely initialized and shut down, preventing deadlocks and ensuring proper resource handling, even in complex nested execution scenarios.

Highlights

Deadlock Fix: Addresses a potential deadlock in CallbackList by removing the unsafe Executor.shutdown call from the __del__ method, which is unreliable for resource management.
Executor Lifecycle Management: Implements a new, robust mechanism for managing the ThreadPoolExecutor's lifecycle using on_train/test/predict_begin and on_train/test/predict_end callbacks.
Nested Call Handling: Introduces an internal counter, _in_begin_end_block_count, to correctly manage the executor across nested operations (e.g., evaluate within fit), ensuring the executor is only shut down when all related blocks are complete.
Lazy Executor Initialization: The ThreadPoolExecutor is now initialized lazily on the first _on_begin call when needed, rather than being created during the initial configuration phase.
Method Renaming for Clarity: The method _clear_futures has been renamed to _flush_futures to more accurately reflect its purpose of executing and awaiting futures rather than simply discarding them.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves a potential deadlock in CallbackList by replacing the unsafe __del__ implementation with a reference counting mechanism to manage the ThreadPoolExecutor's lifecycle. The new approach using _on_begin and _on_end hooks is robust and correctly handles nested calls and reuse. The renaming of _clear_futures to _flush_futures also improves clarity. I've added a few suggestions to improve docstrings for the new and renamed methods to enhance maintainability, in line with the project's style guide.

keras/src/callbacks/callback_list.py

codecov-commenter · 2025-09-29T18:34:12Z

Codecov Report

❌ Patch coverage is 91.30435% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.59%. Comparing base (cc56474) to head (9b129f4).

Files with missing lines	Patch %	Lines
keras/src/callbacks/callback_list.py	91.30%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #21701   +/-   ##
=======================================
  Coverage   82.59%   82.59%           
=======================================
  Files         572      572           
  Lines       58314    58327   +13     
  Branches     9130     9131    +1     
=======================================
+ Hits        48166    48177   +11     
- Misses       7817     7818    +1     
- Partials     2331     2332    +1

Flag	Coverage Δ
keras	`82.40% <91.30%> (+<0.01%)`	⬆️
keras-jax	`63.30% <91.30%> (+<0.01%)`	⬆️
keras-numpy	`57.65% <65.21%> (+<0.01%)`	⬆️
keras-openvino	`34.32% <13.04%> (-0.01%)`	⬇️
keras-tensorflow	`64.04% <52.17%> (+<0.01%)`	⬆️
keras-torch	`63.63% <91.30%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

A memory leak related to the executor in `CallbackList` was fixed in keras-team#20779 However, calling `Executor.shutdown` within `__del__` is intrisincally unsafe and can create deadlocks because the garbage collector can be called in different contexts. This new approach uses the `on_train/test/predict_begin` and `on_train/test/predict_end` callbacks to detect when we're done with the executor. - it counts the number of "begin"s and "end"s to handle the case of `evaluate` within `fit` (we do not shutdown the executor at the end of `evaluate` but instead keep it around for the rest of the training) - it also handles `CallbackList` being reused between calls to `fit`, `evaluate` or `predict` even though Keras doesn't reuse. Also renamed `_clear_futures` to `_flush_futures` to make it clear futures are not discarded, but exectuted.

Original PR #21701 by hertschuh Original: keras-team/keras#21701

Merged from original PR #21701 Original: keras-team/keras#21701

google-ml-butler bot added the size:M label Sep 29, 2025

google-ml-butler bot assigned gbaned Sep 29, 2025

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

keras/src/callbacks/callback_list.py Outdated Show resolved Hide resolved

keras/src/callbacks/callback_list.py Outdated Show resolved Hide resolved

keras/src/callbacks/callback_list.py Show resolved Hide resolved

hertschuh force-pushed the callback_list_shutdown branch from 66b1ea8 to 9b129f4 Compare September 29, 2025 19:06

hertschuh requested a review from divyashreepathihalli September 29, 2025 20:40

google-ml-butler bot added the awaiting review label Sep 29, 2025

divyashreepathihalli approved these changes Sep 30, 2025

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Sep 30, 2025

kokoro-team removed the kokoro:force-run label Sep 30, 2025

hertschuh merged commit 5ae5503 into keras-team:master Oct 1, 2025
11 checks passed

google-ml-butler bot removed awaiting review ready to pull Ready to be merged into the codebase labels Oct 1, 2025

hertschuh deleted the callback_list_shutdown branch October 1, 2025 00:01

ryantqiu pushed a commit to snorkel-marlin-repos/keras_c79dfd9c that referenced this pull request Oct 1, 2025

Fix deadlock in CallbackList.

c73753f

Original PR #21701 by hertschuh Original: keras-team/keras#21701

ryantqiu mentioned this pull request Oct 1, 2025

Fix deadlock in CallbackList. snorkel-marlin-repos/keras_c79dfd9c#1

Merged

ryantqiu added a commit to snorkel-marlin-repos/keras_c79dfd9c that referenced this pull request Oct 1, 2025

Merge pull request #1: Fix deadlock in CallbackList.

f21a935

Merged from original PR #21701 Original: keras-team/keras#21701

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix deadlock in `CallbackList`. #21701

Fix deadlock in `CallbackList`. #21701

Uh oh!

hertschuh commented Sep 29, 2025

Uh oh!

gemini-code-assist bot commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Fix deadlock in CallbackList. #21701

Fix deadlock in CallbackList. #21701

Uh oh!

Conversation

hertschuh commented Sep 29, 2025

Uh oh!

gemini-code-assist bot commented Sep 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix deadlock in `CallbackList`. #21701

Fix deadlock in `CallbackList`. #21701

codecov-commenter commented Sep 29, 2025 •

edited

Loading