Skip to content

[Feature] Add option to overwrite cache_key #676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

jan-janssen
Copy link
Member

@jan-janssen jan-janssen commented Jun 14, 2025

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Improved task key generation by supporting custom cache keys when provided, enhancing flexibility for caching behavior.
    • Added option to disable dependency resolution during task execution for more control.
    • Enhanced cache file discovery to include recursive search within cache directories.
    • Ensured automatic creation of necessary directories when saving cache files.
  • Tests
    • Added tests to validate cache key functionality and ensure correct caching behavior with custom keys.

Copy link
Contributor

coderabbitai bot commented Jun 14, 2025

## Walkthrough

The function `serialize_funct_h5` in `executorlib/standalone/serialize.py` was updated with an optional `cache_key` parameter that, if provided, is used directly as the `task_key` instead of generating it by serializing and hashing inputs. The `execute_tasks_h5` and `execute_tasks` functions were extended to accept and propagate this `cache_key`. Additionally, `execute_tasks_h5` gained a `disable_dependencies` parameter. Cache file handling now uses a recursive search for cache files. New tests were added to verify cache key functionality and cache file naming. The `dump` function was enhanced to ensure output directories exist before file creation.

## Changes

| File(s)                                               | Change Summary                                                                                                         |
|------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| executorlib/standalone/serialize.py                   | Added optional `cache_key` parameter to `serialize_funct_h5`; uses it as `task_key` if provided, else hashes serialized inputs. |
| executorlib/task_scheduler/file/shared.py             | Added `disable_dependencies` parameter to `execute_tasks_h5`; extracts `cache_key` from `resource_dict` and passes it to `serialize_funct_h5`; uses `get_cache_files` to check cache files. |
| executorlib/task_scheduler/interactive/shared.py      | Added optional `cache_key` parameter to `execute_tasks` and `_execute_task_with_cache`, propagating it to `serialize_funct_h5`; uses `get_cache_files` for cache file existence checks. |
| executorlib/standalone/cache.py                        | Added `get_cache_files` to recursively find cache files; modified `get_cache_data` to use recursive file discovery.     |
| executorlib/task_scheduler/file/hdf.py                 | Modified `dump` to ensure directory exists before opening HDF5 file by creating directories as needed.                  |
| tests/test_cache_fileexecutor_serial.py                | Added `test_executor_mixed_cache_key` to test submitting tasks with explicit `cache_key` in `resource_dict`.              |
| tests/test_singlenodeexecutor_cache.py                 | Added `test_cache_key` to verify cache key usage and filename tagging in cached tasks with explicit `cache_key`.          |

## Sequence Diagram(s)

```mermaid
sequenceDiagram
    participant Caller
    participant serialize_funct_h5

    Caller->>serialize_funct_h5: Call with fn, fn_args, fn_kwargs, resource_dict, cache_key (optional)
    alt cache_key provided
        serialize_funct_h5->>serialize_funct_h5: Use cache_key as task_key
    else
        serialize_funct_h5->>serialize_funct_h5: Serialize inputs and hash to generate task_key
    end
    serialize_funct_h5-->>Caller: Return data dict with task_key

Possibly related PRs

Poem

A cache key hops into the code,
Where hashes once alone bestowed.
Now, if you bring your own delight,
The function uses it outright!
Dependencies may pause or play,
While tasks and caches find their way.
🐇✨


<!-- walkthrough_end -->
<!-- internal state start -->


<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEYDEZyAAUASpETZWaCrKNwSPbABsvkCiQBHbGlcSHFcLzpIACIAbQAxEmpsfwBdSABBWnp8bnF8LAJ+KQoAdwp1DwY0BlgSAH0AaxJZaMhStGQHAWZ1Gno5MLrIbERKSCFMMEmMRDGsTus7DEcBcYA2AHZ1jRhh2xQMXApFbAZpdEgAMyTcFI80H3xS+AwiEbGKZCL8EvLKoYeAAG1VqDWasiB7TqWG6vVw4jeYU6jW++EBkBIAA9RHh8BRdu4cLhYPjDup4I8vPIyExsEdKFFECwPAJsEQZHg/NJ8F4pPQSSRmAAadAYAXDJhHV7YfCjQ40IhUfJYAAUAGEAJIASjCIWQGGePE6iEJ+zsqFeNHFUSKaGy/jmYpQc2C3O4+P6kEGmHwgr4o0ooq8lOYsGwawoov8XmotvRoIQSLWsDQEng+LNHm43l8/iCIRd6Ak2ljAki1zJ/nTJFK6FCsAR3EQHAA9K2iOpwwINExmK3uLJ4CcMK3sbiCBQQwIB7nW1sduZLOqWGwjl1HMxnK4DETaEgGKNEBmsAUhqgcz5uQXEKFXhJeVJvqV0VvXgojhU2ZOWxMpjM5jISBVUFYlSQoXVMHoEM0DDCNxlVNAPnGNArhuMQokGUCkAcEhtQ0IwACl/0wQCsCuYdb0OK58WYZAYLgyMhmodArhoPgKKxV53mZNhyXER5vXZRBRVArU9VvZBnGzE1uLFegUXo0Nu3GIoSngK55A0jFakwUhkEdR8PGw11jLqGkMDpBl/FoAiDAAGWU+C+ClCiKDo5jQlAzj2nxVF0GQK0yCUWhRWxXhpGPJFEGoJArhqFV2i7DFDK8PATzsxzYJUvhHmZex2X03ApKwMhU0soV2ACy5ohfbxaDANYwAweBzjaG5kn8LgyBUEMkVAkEajqJoWihIogooCLQm4ahYBE9AnheJFAy+MJ0XIKjEyiR1vGKw5jyUAqBDAGivCUVaXhJd8Bq2+p938MR8UhLN5F4X54EOpCmEO7FYO4CslGYApb2VOTSTrRBuFEDShyRJDBrBEbISSklLm6e7REnV7ZsYfw4wNEIoi2654EiU1IAACWeEgSlE8zIENGhPJ0lJ/COSAgVIXB6lu2hqDQKErnpMQT0gWh8AuRmxZIGgxEBMYGcJ+hiYosnRSg7kwHHQ9EXeUDcBRSBIntNFLiBAA1R5ggAUQoE4KChNZqkDDE+YNsZQgwWCHi8PHaBpLjJN2LLGJQhhGkNUpIlofTzykqTIF6Q0+Bwt0NYcIgisC0I6TO70HgdSKonfJChbufxPCvfNgiokkWPpDT4AuUDytoPr3nwK5jRJMAQ2afPU3TMkaL4AQ/VgDneeHTGnqhDWEeGiEgTs4iMGmUj5m5SGCloGQtPFVrYv61MvOGN6PTGZXytj3oiEbfPMU6Ju+HG5h/sq9nUPYxPKE7OHGbqHwS8eZAg11CKqRsuBmxtg7F2CMvYWADiHCOMcOIdb4mnLOHw84AAcABGSC4pE7OH8qBQ8dsqrAOvGA6qSFaBUDYpWF+wxxB8TWNxOyAA5dEQsKD+gUKwdgyBh4kDoAIGojR0AVwivMO8hRhj7kQIeOYGUjD6GMOAKAwV+BdzQHiYgZBlBej7GuTgPA+CCBECLJ83p5DfWUKodQWgdDqJMFAOAloSrEkIKQcgyoiarnYFwKgENNzblsQoc6Kg1CaG0LoMAhgNGmAMNrPEU54AzlvFBR4BQSCtk+JSEMAAvEgGhBwcAMNEKpBgLCZE1AYvxcZ6AOCcC4HRjBr7SDcMMIWllEpAgKY8eAJT6i9LEPUWAABWKEHRkBA33BReM6AGDnDyM6e0+4VSCVyFs3wM0qBsG/gvcEo1dgAHUYSTyGic5GF4TjphCsBKW3DyDalFOoIsgZ6AYzENSaqA13aNCRlCV4t4kg5C7r4oxclQIQgiYM4pMKenCxVO8vazgiCOCEerIhEIXwUAUhQTFpjkCXQniCLwcpaDcFao0SIc8iGpkQEmPWwwdpeE0JATUXdjnAqLFLN6Dy6B0w8PieAf9BIpjTBmCxkLDGg2PsCQFfLBhMpZRiBFwyohuyQqgfwBtXh0CzNycu5Bvn8zFq1LZbSNa/GUFeMZiUpVDz4P4N8WBhadNskYJcmQOXQuBmtV2ohYyg0DZ3TEWIPR8KiGSHM5ZWqYmlOILpBgoBZEeTsk8gl9neyObdCEXAADyeRs1eFiCDdIABeSALySBjXRN5FFosBmUEKVq0ZKKJnTMOBzVJk5MFZPFDk8g+S21DJKWUyElTqlpuSf2jBGTWzKqUXUWg3hKCtlVnk5l0lbLlJndEGplgMj1Khf45pYS2kRt0m8VNRIgT9oaICxA3bBbNrPLMkY3A3ZLPgG/T08lkIcy5jza5ozSbSEFicZgfa0FpOnBoIdfNKXkF7NckFswaD2naT8iI8gQy3jktuwK8iqjXMtQ9LGKNyX4FNIR3A91l57A8I60WkceC5ATkCaIBaWjRGgywDEgKt5ygoOcS1IsCjhI1jNQC2dI3/VauoP5ix4Z8eRhirF7MiitoqBOhojq33nMuWCcOxGySgWxEga0EmI2gTlFArkKtIMisiR4QMyBOYyzA4jEjqop5UaetqBt0t2JJw8NpBzeAcyhG3UWGRVVlSAOYlgBjSKPB4aeh+a0xVjWOlCPZ4YlLOwMHeZZNK+4kSt3brhkgUMbSWSbiVAUht+0nmjEKbQswRiWS9XZWpGR/Vhp60UUCSgGChtiuGru4VPSxqARGEMDAk3hGa2oyA8QP2wnFV7cuHmf1NK4JAAwkAOZKFm/Bmg9QX3dtVGXe49QCzBC4M90pABFYIwRRSBZni4LgIMwpXcM9trg6oqS9RIJ15kKRzh3StVwfcYhRKUCTnGTtfSTzFtLdJ8t4OfCQ+rbW3JopByIACGgHmBQKJEHh0F/7kAS27IrccIndbRTiPDsFbHzPK2QBrXW3UCTifkChCds7gAkwjO6d87JBLsTmfYpO7D3/BPa+yQV76uNCfZIN9xg4GssM8B5GhXGOpMYDBxD8sUPROw4aEjsxDuUfuVeOjtjBQedltiPjsskQ2ck54LIcnlPXLirp392Qnvccs4oP78gHOJHc8Zzjr25a+cC4D4oyHd16vBVpM1rgY9eT8823lPCug9Ai/rYetR86ruLpnCusE67IgUFbBNBKkgd2phslOipVSj2DbPfKppBVWnyBvZ0xA3TgRPpu4pBl9AgT1Dn8qslvm6jvsx4GwerJREesO16O0Kz6uhEwPwFPgleVL2NAcmWQZoStQnlSZ4wiSgVCq6yzLcu9Ecv12CSAOFL5Jhf/dVETdlPLPYC8ZwPNcYVADZJZQUYcewXbTqC4DWcWQ8UxYuMjZAzAkGbicmTUWYD6YEFfYHefRAIFdfLaIENza/UaBLE0RA4YPTdtEZIzKZLfc3INV4K4NKWkcjAAuFc9abDAY1FzCsazMFCqDpUQSRJlfOICW9UgegK4GDNGf9d+EmSIL2PiNgZgSMZleAbgOQ8OXtcbaeR6NpdLfqdEc/FQZkNKJmGaVGAwowhAUwszSRUYDLEDHzW6Ejbg1FR/WocxaQRwXqeQfVFIHrJCBjdpRw3kPAIQnpSDbuOaM0PVerWMc4b4M+asDMeUErRNOuHOOoLnegbSIEZVOFAAahiHqHwA0CmQE3aEWGumKxs3aRIwsMlAo0N1kGNSSOcJklRiK1SNY3SNQA42AKlGPBkPw12G4X4H4RUObgTAKGOGL34KNDJEoHtkgBqzklKEZGISUDsl9SG3YjENNnGxDRgJVGEUu2jS9DjSW0TXYApFTSgC223x2yID23uG/V/VoGOxlyBAuxNx1kVyoMQHuzwEeze0111w+3Vx+3xGkC4CtBLzwVFEhjQFKD8S4FwFkChliAACFOgSAABlGaIkygInAAWRpWtjQTpMJL8VFFJFvD0IaEpWqC8B5LMSZy9yLy8Djxt1eHUDNxVCj1T29ytz9xL3Z1APtysKxnlMeBj0lNFDe3qCEHwFeCaIwHqGZTwHFiJML3wGLxrWgAoD1xKyaOEDNK1WtNtNLy8DGFFAACofTGgOgiVEAhdK860xcZcpdxdZd5cYTKDUQET9s1dUSUTghtd0SFBHRsT2Ya08T7B6TiSwgySSBKTqSOSGTY8S8WT4A2TRAyyuSji6NcA+T6gBTHhhStTy1xTdT+JZSsdk9mcfdCcVSA9ftrDI9+yvd08q8ftwNC0Jzo8pzVT9TDTjSCgzTwxGNngLdvQbTfA7SHSbcnTLFXSSl3S9zPTvTIA/SAyMVgyK8q9wyzt3xH1690lG8UQzTm8N028O9rFu890p0gQNs/ieDjxAS0CQSjtpcztIS5dIByDTc18uwN8SBVQJp4pzguAaT8AudcBiD2IMKbdlUHdEcrVRQVcGhkTIA3s0zUSZzEZBiAdjgQtoLIBIyISoSELYykKSQUK0LrJCKsKcLmg8KBKagiKPySLJNcByLETVcqKaKdc9dRzNT7Bjh6LF4WgOydThzXlHze0XyJwG9l0PzV06Bvz29rJO8pB8ke8jVBwgKB9a8wAjAaBJJl0QgUKINIgF0KAzTx0vA+8a8h8GkjEmQr0J8u51jp8DAMhFY6w3LQhDlSQl9EquK0l6hegcRaAvKb8v0ECBRG1WDoAQhwcwQaz0EKAaSAqxpCxJsTQsjvhCw4R1AE4RMxtWD4hINoAUQaSvzW8ZlkpQIYo+J3cFgiEtMSVqoajPLfLMr4BsrHYuRXhJtsAlBuNDI7d4cxA54iVtNQhIZoZNI5J4ZeNZz+NBrUZQISwBCOZog0BWwZwGABNjVEqApPg9oyiMQKKixpSBIfB5ApZxZ49AQsAOhWqQDQIICgN1JFkFNQE8pIAABmHFNQ13AGj8NyDyJtfbIsYG0pFjHLE4XwXYusDWZgvhE8OZYcQ4kkOjSLLAGa28dKyceaxa0UMlDEDZCkM8CYtUwAloPojwTa8TTLK1bNFwAbE9YbW4oNe4+qkbZ4xTebHIRbBNFbL4lNGKqAJkmWFK9AbIOgLgWCruNKuarKugXKloVUMYLwK4Fi98eqp0IEEq28MquoCqtJaq/TLwTDDmRKxADy5mwIyDOahFQCmvOdFygwAOoO7mKKIgXQxQURV8lCoKpykK0Qr0FpLca9KKqfIwOK8gBKwsZKxQf2zyjTGZNTQ2wqjEF20q65ECp42qzaWMOYRqiSHOPGNys2Gk7iSIbhJQT2ycTDIdCTTmr6AYjUp6fEiMeEfI/wYyRSZmPsWLDwFpUlZKcKZbFTeQD2K5RGG/G6muIWjmEWuHB3KEVUJIMIuFCKTiKITm6IEavJIQaIN5IDJgHwTGfIkgZAiA8mDINicYdrAoNFE1CoGmZufogAnVGGttOG5mW+ieYmdgFwAAcmQG3T5Jy26z/qOgIKRBfu9miF2E1DPy9PsMAj4XyJYmGscESKvC2hyBizwDmWoDBAIZaXaX1kNkAaAy+oYdgz5teFiykWJSETUsxKqLI35q3FwC4Y1UYb5sAc7rerXsiCmukN1n5puo+jEPhShgYEbkFL+W8LkmJiAJilICAxwe9mREzg4R9WPT9RuKeLloUQeMVvaTmxjVVs8HVtW2+O1sgF1rpoUlruNqhLSo0xtpIDtodqwCdq81dtwHdpIGbqpr9qZuKjjtdLeCTqUDmq2gjqcqjqMF8sHQNmHVQzyVKYPQzpPWH0aWzoivaWisLviqOISahg4m238O5hDrJmrqklrqDQehSGPCkD+TGGcDCJJBOHZAnlAk7CkGSenvp3kHJqvAhvRhnqgeweHkg2QGChOOSmiCaJaMmTIagIMhyPEoIYiiHnlDQG4DeiGgMtA15n5hmQQDCIKD+QYyiF6Lwz+U6MmMowjxo0c25HIWPAKGNW8yGYN1+euAGa/XmUbiWRdiugJjrDqC8D6aDUEANTkYSIjRjBpkwBzgoxIxFVPAa0fjCPiwwOkACq1RQD2ilFy3jnzhHgJqJDtSoCvDl3QjvDodCGRa8p1ShA4zejyLGCkivAUdqEsbpZObhama7z+TJQhahbHLCCXr8FmlUnKgmFGFPmMlyF7mgd8EGMuNceuIDVGyKu/wVtlojT8beLVuW2Ca1uAoGYKuibgu+fA38xUqeiYognvIYx1JyaqaXWQxHXqeuUAul1+IGbAqBIrk9T0kwmWrfi0fYEMewEPyNujMgDDYYv5gCwNwOfHJBhDKNhs1iAd1SATdfOqeyTqdbAaeRlVA4zMa82reGiCJAMma+C7x0LFqUWFdkG1EjogGjsTffKoM/LXQsu3VbFgFoCuHTuqUzpHzafHw6YLt3FYPXTfmCNFkxcUGxbrrIAcArksK2YyMhuGApgABF4hJkZ3I0bMZA5dMSL8yA/Dt1dg6SjrD4Aa3N4tXCJ5UApQSgj97CFgBAnCUj33fCkQgQ6MylZotAMOEO6DQbXYG3b9CtTaEApIMPkiXCcYkOe7i4u4PlxZJY/QAOqIcP3g8PTQtxmh7pEBLryUdGmjGgq17TghmMiRv3f3/3UBBRGXDF6BObQIRisOEP9psMIUMQxUJVfAgRt16g+SZOaOOl82Dbd5LgIpc1qMwVTC7QHwPpMQ7Z8RSVLlX3oWR5vr0j2ODROOdGpa3GXW7ivGPWPGvWo0Vb+BfXPjk11s01icLO70wuN7UD9slbvWFtAm/XNbmsCJyn1EDA3Ek1dO9ECBQqL0BFTFglCSx9c75BBh7EqBHFYkXEEkSvNEauZSPpX1CjaxLask+F4lElSuAAGdYK4HBSZBgHBAATnG9m82Hm4ABZ5u8FNukaFxaBVu1h1hVucErhFu0BJkrgka+ZJlVvRuuuIBIB5vVA8EcE0BxvNgBBaAGB5uNukaGAru0AXv1g8FaAkbFuGAAAmFb47yZTYFbgQabm70rkgHBMH2gCHgQSZFQWgebq4VboHpGzYF71H6oJHi7XbyZSZL7vdgQebpGhH7r+0AQcbkgbH0QVbugSZLYZHgn+b9YJGnBHBVQHBVb3b9YUQNAdYebubhgTYG727qAExXr3eeoAb04nK7RfQIAA== -->

<!-- internal state end -->
<!-- finishing_touch_checkbox_start -->

<details open="true">
<summary>✨ Finishing Touches</summary>

- [ ] <!-- {"checkboxId": "7962f53c-55bc-4827-bfbf-6a18da830691"} --> 📝 Generate Docstrings

</details>

<!-- finishing_touch_checkbox_end -->
<!-- tips_start -->

---

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>

<details>
<summary>🪧 Tips</summary>

### Chat

There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=pyiron/executorlib&utm_content=676):

- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
  - `I pushed a fix in commit <commit_id>, please review it.`
  - `Explain this complex logic.`
  - `Open a follow-up GitHub issue for this discussion.`
- Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples:
  - `@coderabbitai explain this code block.`
  -	`@coderabbitai modularize this function.`
- PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
  - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.`
  - `@coderabbitai read src/utils.ts and explain its main purpose.`
  - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.`
  - `@coderabbitai help me debug CodeRabbit configuration file.`

### Support

Need help? Create a ticket on our [support page](https://www.coderabbit.ai/contact-us/support) for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

### CodeRabbit Commands (Invoked using PR comments)

- `@coderabbitai pause` to pause the reviews on a PR.
- `@coderabbitai resume` to resume the paused reviews.
- `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
- `@coderabbitai full review` to do a full review from scratch and review all the files again.
- `@coderabbitai summary` to regenerate the summary of the PR.
- `@coderabbitai generate docstrings` to [generate docstrings](https://docs.coderabbit.ai/finishing-touches/docstrings) for this PR.
- `@coderabbitai generate sequence diagram` to generate a sequence diagram of the changes in this PR.
- `@coderabbitai resolve` resolve all the CodeRabbit review comments.
- `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository.
- `@coderabbitai help` to get help.

### Other keywords and placeholders

- Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed.
- Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description.
- Add `@coderabbitai` anywhere in the PR title to generate the title automatically.

### CodeRabbit Configuration File (`.coderabbit.yaml`)

- You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository.
- Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json`

### Documentation and Community

- Visit our [Documentation](https://docs.coderabbit.ai) for detailed information on how to use CodeRabbit.
- Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback.
- Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

</details>

<!-- tips_end -->

@jan-janssen jan-janssen linked an issue Jun 14, 2025 that may be closed by this pull request
Copy link

codecov bot commented Jun 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.81%. Comparing base (40296a1) to head (adb0e9f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #676      +/-   ##
==========================================
+ Coverage   96.78%   96.81%   +0.02%     
==========================================
  Files          29       29              
  Lines        1306     1317      +11     
==========================================
+ Hits         1264     1275      +11     
  Misses         42       42              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
executorlib/standalone/serialize.py (1)

65-76: Add unit tests to cover the new branch

Codecov flags the newly-introduced cache_key branch as uncovered. A minimal test should:

  1. Call serialize_funct_h5() with a dummy cache_key in resource_dict.
  2. Assert that the returned task_key equals that value (after str() coercion if the previous comment is addressed).
  3. Verify that the function still behaves identically when cache_key is absent (hash path).

This guards against future regressions and keeps overall coverage steady.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 66-66: executorlib/standalone/serialize.py#L66
Added line #L66 was not covered by tests

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 609d185 and 06f85c8.

📒 Files selected for processing (1)
  • executorlib/standalone/serialize.py (1 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
executorlib/standalone/serialize.py

[warning] 66-66: executorlib/standalone/serialize.py#L66
Added line #L66 was not covered by tests

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
  • GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
  • GitHub Check: notebooks_integration

Comment on lines 65 to 67
if "cache_key" in resource_dict:
task_key = resource_dict["cache_key"]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Validate & normalize cache_key to guarantee a str return type

serialize_funct_h5() is documented (and relied on elsewhere) to return task_key: str, yet the new code passes through whatever object happens to be under resource_dict["cache_key"] unchanged.
If callers accidentally supply an int, UUID, or any other non-string value this will violate the contract and break task_key consumers later in the pipeline (e.g. when used as an HDF5 node name).

-    if "cache_key" in resource_dict:
-        task_key = resource_dict["cache_key"]
+    if "cache_key" in resource_dict:
+        task_key_value = resource_dict["cache_key"]
+        if not isinstance(task_key_value, str):
+            task_key_value = str(task_key_value)  # or raise TypeError
+        task_key = task_key_value

While you’re here, please add a short note to the docstring clarifying the new cache_key behaviour and acceptable types.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if "cache_key" in resource_dict:
task_key = resource_dict["cache_key"]
else:
if "cache_key" in resource_dict:
task_key_value = resource_dict["cache_key"]
if not isinstance(task_key_value, str):
task_key_value = str(task_key_value) # or raise TypeError
task_key = task_key_value
else:
🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 66-66: executorlib/standalone/serialize.py#L66
Added line #L66 was not covered by tests

🤖 Prompt for AI Agents
In executorlib/standalone/serialize.py around lines 65 to 67, the code assigns
task_key directly from resource_dict["cache_key"] without ensuring it is a
string, which can break downstream consumers expecting a string. Modify the code
to explicitly convert the cache_key value to a string before assigning it to
task_key. Additionally, update the serialize_funct_h5() function docstring to
include a note about the cache_key parameter, specifying that it will be
normalized to a string and listing acceptable input types.

@jan-janssen jan-janssen marked this pull request as draft June 14, 2025 01:51
@jan-janssen
Copy link
Member Author

@liamhuber After some initial bug the CI now seems to pass, can you check if it resolves your issue?

@jan-janssen jan-janssen marked this pull request as ready for review June 14, 2025 03:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
executorlib/task_scheduler/file/shared.py (1)

122-133: ⚠️ Potential issue

Skip scheduling when dependencies are disabled but Futures are supplied

When disable_dependencies=True and future objects are detected, the code sets an exception on the user’s future but still schedules the task, resulting in an unnecessary subprocess launch that will almost certainly fail.

-                if len(future_wait_key_lst) > 0:
-                    task_dict["future"].set_exception(
-                        ValueError(
-                            "Future objects are not supported as input if disable_dependencies=True."
-                        )
-                    )
-                task_dependent_lst = []
+                if future_wait_key_lst:
+                    task_dict["future"].set_exception(
+                        ValueError(
+                            "Future objects are not supported when disable_dependencies=True."
+                        )
+                    )
+                    future_queue.task_done()
+                    continue  # Do NOT schedule the invalid task
+                task_dependent_lst = []

Avoids wasted work and clearer semantics.

🧹 Nitpick comments (3)
tests/test_cache_fileexecutor_serial.py (1)

45-51: Consider validating that the custom cache key was actually used

The new test checks that the task completes and returns the correct value, but it does not inspect the cache directory to ensure the file was written under the expected key (abc). A quick extra assertion will protect against regressions where cache_key is silently ignored.

@@
     with FileTaskScheduler(execute_function=execute_in_subprocess) as exe:
         fs1 = exe.submit(my_funct, 1, b=2, resource_dict={"cache_key": "abc"})
@@
         self.assertTrue(fs1.done())
+        # Optional: verify the cache folder was created with the explicit key
+        cache_path = os.path.abspath(os.path.join("cache", "abc", "cache.h5out"))
+        self.assertTrue(os.path.exists(cache_path))
tests/test_singlenodeexecutor_cache.py (1)

37-50: Minor pythonic nit ‑ prefer generator expressions inside sum()

The intermediate lists created in the two sum([...]) calls are unnecessary; using generator expressions is more memory-efficient and makes the intent clearer.

-        self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
+        self.assertEqual(sum(c["output"] for c in cache_lst), sum(result_lst))
@@
-            sum([sum(c["input_args"][0]) for c in cache_lst]), sum(result_lst)
+            sum(sum(c["input_args"][0]) for c in cache_lst), sum(result_lst)
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 47-47: Consider using a generator instead 'sum(c['output'] for c in cache_lst)'

(R1728)


[refactor] 49-49: Consider using a generator instead 'sum(sum(c['input_args'][0]) for c in cache_lst)'

(R1728)

executorlib/task_scheduler/file/shared.py (1)

115-121: Simplify existence check – avoid two os.listdir calls

Listing the whole directory tree on every task iteration is O(n) per call and becomes expensive for large caches. A direct os.path.exists on the expected output file is faster and simpler.

-                if not (
-                    task_key in os.listdir(cache_directory)
-                    and "cache.h5out"
-                    in os.listdir(os.path.join(cache_directory, task_key))
-                ):
+                output_file = os.path.join(cache_directory, task_key, "cache.h5out")
+                if not os.path.exists(output_file):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06f85c8 and 9bb18a0.

📒 Files selected for processing (5)
  • executorlib/standalone/serialize.py (3 hunks)
  • executorlib/task_scheduler/file/shared.py (2 hunks)
  • executorlib/task_scheduler/interactive/shared.py (6 hunks)
  • tests/test_cache_fileexecutor_serial.py (1 hunks)
  • tests/test_singlenodeexecutor_cache.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • executorlib/standalone/serialize.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/test_singlenodeexecutor_cache.py (3)
executorlib/executor/single.py (1)
  • SingleNodeExecutor (20-180)
executorlib/executor/base.py (1)
  • submit (53-87)
executorlib/standalone/cache.py (1)
  • get_cache_data (16-42)
🪛 Pylint (3.3.7)
tests/test_singlenodeexecutor_cache.py

[refactor] 47-47: Consider using a generator instead 'sum(c['output'] for c in cache_lst)'

(R1728)


[refactor] 49-49: Consider using a generator instead 'sum(sum(c['input_args'][0]) for c in cache_lst)'

(R1728)

🔇 Additional comments (2)
executorlib/task_scheduler/file/shared.py (1)

105-113: cache_key extraction looks good
The key is removed from the resource dictionary before hashing, preventing it from influencing the function hash. 👍

executorlib/task_scheduler/interactive/shared.py (1)

135-158: Duplicate cache key in both cache_key arg and resource_dict

serialize_funct_h5 receives cache_key=cache_key while the original (possibly unchanged) "cache_key" remains inside resource_dict. If the serializer also hashes the dictionary, this duplicate information could change the hash or trigger unexpected behaviour.

Recommend removing the key from task_dict["resource_dict"] before the call:

res_dict = task_dict.get("resource_dict", {}).copy()
cache_key = res_dict.pop("cache_key", cache_key)
...
resource_dict=res_dict,
cache_key=cache_key,

Comment on lines +44 to +47
cache_lst = get_cache_data(cache_directory=cache_directory)
for entry in cache_lst:
self.assertTrue("same" in entry['filename'])
self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential false-negative when unrelated cache entries exist

assertTrue("same" in entry["filename"]) will fail if any cache entry does not contain “same”.
Because other tests in this class write to the same cache_directory, this assertion will intermittently fail unless tearDown always removes all prior files and the OS guarantees ordering.

To reduce brittleness:

self.assertTrue(all("same" in e["filename"] for e in cache_lst))

or filter the list for the keys you expect.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 47-47: Consider using a generator instead 'sum(c['output'] for c in cache_lst)'

(R1728)

🤖 Prompt for AI Agents
In tests/test_singlenodeexecutor_cache.py around lines 44 to 47, the assertion
self.assertTrue("same" in entry['filename']) can cause false negatives if any
cache entry does not contain "same" due to other tests writing to the same
cache_directory. To fix this, replace the assertion with one that checks all
entries, such as using self.assertTrue(all("same" in e["filename"] for e in
cache_lst)), or filter cache_lst to only include expected entries before
asserting, ensuring the test only verifies relevant cache entries.

Comment on lines 25 to 27
cache_key: Optional[str] = None,
queue_join_on_shutdown: bool = True,
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Global cache_key parameter may clash with per-task keys

execute_tasks now takes a single cache_key that is forwarded to every task, overwriting any per-task value that might be present inside task_dict["resource_dict"].
This is inconsistent with the file-based scheduler (which honours per-task keys) and can lead to collisions when multiple different tasks are queued.

Consider removing the top-level parameter and relying solely on resource_dict["cache_key"], or merge logic like:

cache_key = task_dict.get("resource_dict", {}).pop("cache_key", None) or cache_key
🤖 Prompt for AI Agents
In executorlib/task_scheduler/interactive/shared.py around lines 25 to 27, the
global cache_key parameter is overriding per-task cache_key values, causing
potential collisions. Remove the global cache_key parameter and update the code
to use the cache_key from each task's resource_dict instead, or implement logic
to merge them by prioritizing the per-task cache_key if present, falling back to
the global one only if necessary.

@liamhuber
Copy link
Member

@liamhuber After some initial bug the CI now seems to pass, can you check if it resolves your issue?

@jan-janssen, brilliant, yes, this works 🚀

@liamhuber
Copy link
Member

One would-be-nice would be if the cache_key can also interpret paths, so that I can manually nest a cached result inside sub-folders in the cache_directory. E.g.

import pandas as pd

from executorlib import SingleNodeExecutor, get_cache_data

def the_task(x, y):
    return x + y

cache_directory="./cache"

with SingleNodeExecutor(cache_directory=cache_directory) as exe:
    f = exe.submit(the_task, 1, 2, resource_dict={"cache_key": "my_key/subdir"})

cache_lst = get_cache_data(cache_directory=cache_directory)
pd.DataFrame(cache_lst)

This nicely creates ./cache/my_key/subdir/cache.h5out, but get_cache_data doesn't see it, and if I re-execute I get ValueError: Unable to synchronously create dataset (name already exists). This makes perfect sense, because I'm sure it's looking directly and exclusively in cache_directory where it correctly does not find anything.

This is merely a nice-to-have though, and if it never materializes I'll still live. And it should anyhow probably be done in a second PR that somehow unifies the optional path-like treatment of cache_directory and cache_key together.

@jan-janssen
Copy link
Member Author

I hope these changes are easier once #681 is merged, so I marked the pull request as draft for now.

@jan-janssen jan-janssen marked this pull request as ready for review June 14, 2025 21:41
@jan-janssen
Copy link
Member Author

@liamhuber I updated the pull request to support hierarchical keys and added the corresponding tests. Can you quickly validate that these changes are compatible to your interface in pyiron_workflow? Then I merge them and release a new executorlib version.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/test_singlenodeexecutor_cache.py (1)

37-47: Assertion can still false-negative when unrelated cache entries slip in

The loop asserts on every entry, which fails as soon as one unrelated file appears (e.g. from a parallel test run). A safer one-liner:

-        for entry in cache_lst:
-            self.assertTrue("same" in entry['filename'])
+        self.assertTrue(all("same" in e["filename"] for e in cache_lst))

This also sidesteps the need for the explicit loop.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 47-47: Consider using a generator instead 'sum(c['output'] for c in cache_lst)'

(R1728)

🧹 Nitpick comments (3)
executorlib/task_scheduler/file/hdf.py (1)

20-23: Doc-string and path handling slightly out of sync

Nice touch making dump() resilient to missing parent directories.
Two small follow-ups:

  1. The doc-string still claims file_name “as absolute path” even though you now transparently accept relative paths – worth relaxing that wording.
  2. If an empty string is ever passed, os.path.abspath("") resolves to the current working directory; os.makedirs("", …) would then raise FileNotFoundError. A quick guard would avoid surprising callers.
-        file_name_abs = os.path.abspath(file_name)
+        if not file_name:
+            raise ValueError("file_name must be non-empty.")
+        file_name_abs = os.path.abspath(file_name)
executorlib/standalone/cache.py (1)

16-31: Graceful handling of a missing cache directory

get_cache_files() now walks the directory tree – great!
Minor edge case: when cache_directory does not exist os.walk() silently returns nothing and callers see an empty list. Down-stream code (e.g. get_cache_data) interprets that as “no cached results”, which is fine but different from the previous behaviour that raised FileNotFoundError.

If you want to preserve the old contract (and catch typos early) consider an explicit existence check:

+    if not os.path.isdir(cache_directory_abs):
+        raise FileNotFoundError(cache_directory_abs)
tests/test_singlenodeexecutor_cache.py (1)

47-50: Prefer generator expressions to avoid temporary lists

Pylint hint R1728 applies here as well – no need to materialise an intermediary list:

-        self.assertEqual(sum([c["output"] for c in cache_lst]), sum(result_lst))
+        self.assertEqual(sum(c["output"] for c in cache_lst), sum(result_lst))

Same for the next assertion. Purely cosmetic but trims memory and keeps the style consistent.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 47-47: Consider using a generator instead 'sum(c['output'] for c in cache_lst)'

(R1728)


[refactor] 49-49: Consider using a generator instead 'sum(sum(c['input_args'][0]) for c in cache_lst)'

(R1728)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e82d27b and adb0e9f.

📒 Files selected for processing (6)
  • executorlib/standalone/cache.py (2 hunks)
  • executorlib/task_scheduler/file/hdf.py (2 hunks)
  • executorlib/task_scheduler/file/shared.py (3 hunks)
  • executorlib/task_scheduler/interactive/shared.py (7 hunks)
  • tests/test_cache_fileexecutor_serial.py (1 hunks)
  • tests/test_singlenodeexecutor_cache.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/test_cache_fileexecutor_serial.py
  • executorlib/task_scheduler/file/shared.py
  • executorlib/task_scheduler/interactive/shared.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/test_singlenodeexecutor_cache.py (3)
executorlib/executor/single.py (1)
  • SingleNodeExecutor (20-184)
executorlib/executor/base.py (1)
  • submit (53-87)
executorlib/standalone/cache.py (1)
  • get_cache_data (33-56)
🪛 Pylint (3.3.7)
tests/test_singlenodeexecutor_cache.py

[refactor] 47-47: Consider using a generator instead 'sum(c['output'] for c in cache_lst)'

(R1728)


[refactor] 49-49: Consider using a generator instead 'sum(sum(c['input_args'][0]) for c in cache_lst)'

(R1728)

Comment on lines +47 to +55
for file_name in get_cache_files(cache_directory=cache_directory):
with h5py.File(file_name, "r") as hdf:
file_content_dict = {
key: cloudpickle.loads(np.void(hdf["/" + key]))
for key in group_dict.values()
if key in hdf
}
file_content_dict["filename"] = file_name
file_lst.append(file_content_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

I/O errors during cache scan bubble up unhandled

While iterating over the returned file list, any corrupted / concurrently-written HDF5 file will raise inside h5py.File(…), aborting the whole scan.
Depending on the executor’s tolerance for partial cache availability you may want to continue past unreadable files:

for file_name in get_cache_files(cache_directory):
    try:
        with h5py.File(file_name, "r") as hdf:
            ...
    except (OSError, IOError) as exc:
        # log.debug("Skipping unreadable cache file %s: %s", file_name, exc)
        continue
🤖 Prompt for AI Agents
In executorlib/standalone/cache.py around lines 47 to 55, the code currently
opens HDF5 files without handling I/O errors, causing the entire cache scan to
abort if a file is corrupted or concurrently written. To fix this, wrap the
h5py.File opening and reading logic inside a try-except block that catches
OSError and IOError exceptions. On catching these exceptions, log a debug
message indicating the unreadable file and continue to the next file, allowing
partial cache availability without aborting the scan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] expose a keyword for manual cache key
2 participants