Propagate close() from DatasetIterator to data sources#1185
Propagate close() from DatasetIterator to data sources#1185tillahoffmann wants to merge 1 commit intogoogle:mainfrom
close() from DatasetIterator to data sources#1185Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
This change ensures that when `DatasetIterator.close()` is called, the cleanup propagates all the way down to the underlying data source, allowing proper resource cleanup (e.g., closing file handles, database connections, or thread-local resources). Changes: - Add `close()` to `MapDataset` that propagates to parent datasets - Add `close()` to `SourceMapDataset` that calls the data source's `close()` method if available - Update `PrefetchDatasetIterator.close()` to call `_map_parent.close()` - Update `GetElementProducerFn.__call__` to call `it.close()` in finally block for multiprocessing cleanup - Add tests for close propagation
97a22ce to
efd84d6
Compare
|
thanks for the change Till. When do we expect so in order to actually close the source the user will have to manually call close on the source itself, right? not on the |
|
Interesting, good point! The problem I was trying to address is the following (but this is probably not the right solution given the constraint you mentioned). I have thread-local connections to a database in the data source. My issue is that when the iter dataset uses multiple threads, I create a connection per thread. But they don't get cleaned up because the iter dataset does not clean up within each thread. Do you have a hunch on the best approach here? Maybe I just need to track all connections in a list and close them in the main thread. |
This follows up on #1069 to call
closeon the underlying data source if it implements aclosemethod. This ensures resources are cleaned up properly.For multi-threaded prefetch, the
closemethod is called once per thread and on the main thread to clean up thread-local resources. This means thatcloseshould be safe to call multiple times which aligns with Python standard practice, e.g., closing streams, database connections, and sockets in the standard library have this behavior.The docs mention that Grain tries to use data sources as context managers if they implement the protocol, but that behavior is not currently implemented (cf. #936). The code provides the utility function
use_context_if_available, but it is not used.@iindyk, given our discussion in #1069, I've used
closerather than context managers to clean up resources. This achieves the clean up, but does not address the doc discrepancy in #936. I can add the context manager implementation but wanted to discuss first given the preference forcloseover context managers in #1069.📚 Documentation preview 📚: https://google-grain--1185.org.readthedocs.build/