Skip to content

Conversation

@hertschuh
Copy link
Collaborator

A memory leak related to the executor in CallbackList was fixed in #20779

However, calling Executor.shutdown within __del__ is intrisincally unsafe and can create deadlocks because the garbage collector can be called in different contexts.

This new approach uses the on_train/test/predict_begin and on_train/test/predict_end callbacks to detect when we're done with the executor.

  • it counts the number of "begin"s and "end"s to handle the case of evaluate within fit (we do not shutdown the executor at the end of evaluate but instead keep it around for the rest of the training)
  • it also handles CallbackList being reused between calls to fit, evaluate or predict even though Keras doesn't reuse.

Also renamed _clear_futures to _flush_futures to make it clear futures are not discarded, but exectuted.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hertschuh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical deadlock vulnerability within the CallbackList class by overhauling the management of its internal ThreadPoolExecutor. It replaces the problematic __del__ method-based shutdown with a more robust, event-driven approach. By tracking the start and end of training, testing, and prediction phases, the executor is now safely initialized and shut down, preventing deadlocks and ensuring proper resource handling, even in complex nested execution scenarios.

Highlights

  • Deadlock Fix: Addresses a potential deadlock in CallbackList by removing the unsafe Executor.shutdown call from the __del__ method, which is unreliable for resource management.
  • Executor Lifecycle Management: Implements a new, robust mechanism for managing the ThreadPoolExecutor's lifecycle using on_train/test/predict_begin and on_train/test/predict_end callbacks.
  • Nested Call Handling: Introduces an internal counter, _in_begin_end_block_count, to correctly manage the executor across nested operations (e.g., evaluate within fit), ensuring the executor is only shut down when all related blocks are complete.
  • Lazy Executor Initialization: The ThreadPoolExecutor is now initialized lazily on the first _on_begin call when needed, rather than being created during the initial configuration phase.
  • Method Renaming for Clarity: The method _clear_futures has been renamed to _flush_futures to more accurately reflect its purpose of executing and awaiting futures rather than simply discarding them.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a potential deadlock in CallbackList by replacing the unsafe __del__ implementation with a reference counting mechanism to manage the ThreadPoolExecutor's lifecycle. The new approach using _on_begin and _on_end hooks is robust and correctly handles nested calls and reuse. The renaming of _clear_futures to _flush_futures also improves clarity. I've added a few suggestions to improve docstrings for the new and renamed methods to enhance maintainability, in line with the project's style guide.

@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2025

Codecov Report

❌ Patch coverage is 91.30435% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.59%. Comparing base (cc56474) to head (9b129f4).

Files with missing lines Patch % Lines
keras/src/callbacks/callback_list.py 91.30% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #21701   +/-   ##
=======================================
  Coverage   82.59%   82.59%           
=======================================
  Files         572      572           
  Lines       58314    58327   +13     
  Branches     9130     9131    +1     
=======================================
+ Hits        48166    48177   +11     
- Misses       7817     7818    +1     
- Partials     2331     2332    +1     
Flag Coverage Δ
keras 82.40% <91.30%> (+<0.01%) ⬆️
keras-jax 63.30% <91.30%> (+<0.01%) ⬆️
keras-numpy 57.65% <65.21%> (+<0.01%) ⬆️
keras-openvino 34.32% <13.04%> (-0.01%) ⬇️
keras-tensorflow 64.04% <52.17%> (+<0.01%) ⬆️
keras-torch 63.63% <91.30%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

A memory leak related to the executor in `CallbackList` was fixed in keras-team#20779

However, calling `Executor.shutdown` within `__del__` is intrisincally unsafe and can create deadlocks because the garbage collector can be called in different contexts.

This new approach uses the `on_train/test/predict_begin` and `on_train/test/predict_end` callbacks to detect when we're done with the executor.
- it counts the number of "begin"s and "end"s to handle the case of `evaluate` within `fit` (we do not shutdown the executor at the end of `evaluate` but instead keep it around for the rest of the training)
- it also handles `CallbackList` being reused between calls to `fit`, `evaluate` or `predict` even though Keras doesn't reuse.

Also renamed `_clear_futures` to `_flush_futures` to make it clear futures are not discarded, but exectuted.
@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Sep 30, 2025
@hertschuh hertschuh merged commit 5ae5503 into keras-team:master Oct 1, 2025
11 checks passed
@google-ml-butler google-ml-butler bot removed awaiting review ready to pull Ready to be merged into the codebase labels Oct 1, 2025
@hertschuh hertschuh deleted the callback_list_shutdown branch October 1, 2025 00:01
ryantqiu pushed a commit to snorkel-marlin-repos/keras_c79dfd9c that referenced this pull request Oct 1, 2025
Original PR #21701 by hertschuh
Original: keras-team/keras#21701
ryantqiu added a commit to snorkel-marlin-repos/keras_c79dfd9c that referenced this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants