-
Notifications
You must be signed in to change notification settings - Fork 4
[Feature] More efficient cache directory structure #681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change refactors the cache file naming and directory structure across multiple modules. It replaces nested directories and fixed filenames with a flat directory structure, using task-key-based filenames with Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant TaskScheduler
participant CacheDirectory
participant Backend
User->>TaskScheduler: Submit task(s) with cache enabled
TaskScheduler->>CacheDirectory: Check for {task_key}_o.h5
alt Cache hit
CacheDirectory-->>TaskScheduler: Return cached result
else Cache miss
TaskScheduler->>CacheDirectory: Write {task_key}_i.h5
TaskScheduler->>Backend: Execute task
Backend->>CacheDirectory: Write {task_key}_r.h5, then rename to {task_key}_o.h5
Backend-->>TaskScheduler: Return result
end
TaskScheduler-->>User: Return result
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (11)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
executorlib/standalone/cache.py (1)
31-33
: Filter withstr.endswith
+isfile
to avoid opening non-files
task_key[-5:] == "_o.h5"
works but is brittle and will happily treat sub-directories namedfoo_o.h5
as files, causingh5py.File(...)
to raise.- file_name = os.path.join(cache_directory, task_key) - if task_key[-5:] == "_o.h5": + if task_key.endswith("_o.h5"): + file_name = os.path.join(cache_directory, task_key) + if not os.path.isfile(file_name): + continue # skip dirs / special filestests/test_cache_backend_execute.py (1)
66-68
: Remove stale nested-directory creationThe new flat cache layout no longer needs
<cache>/<task_key>/
.
Leaving thisos.makedirs
is harmless but clutters the tree and could hide pathing bugs later.- file_name = os.path.join(cache_directory, task_key + "_i.h5") - os.makedirs(os.path.join(cache_directory, task_key), exist_ok=True) + file_name = os.path.join(cache_directory, f"{task_key}_i.h5")executorlib/task_scheduler/file/backend.py (1)
45-46
: Derivefile_name_out
more robustly
os.path.splitext(file_name)[0][:-2]
silently chops two characters, assuming the suffix is always_i
.
If the naming ever changes (e.g._in
), this will mis-slice.-from pathlib import Path -file_name_out = os.path.splitext(file_name)[0][:-2] +from pathlib import Path +file_name_out = Path(file_name).with_suffix("").as_posix().removesuffix("_i")
Path.removesuffix
(Python ≥ 3.11) or manualif base.endswith("_i"):
guards make the intention explicit.
This also improves readability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
executorlib/standalone/cache.py
(1 hunks)executorlib/task_scheduler/file/backend.py
(1 hunks)executorlib/task_scheduler/file/shared.py
(3 hunks)executorlib/task_scheduler/interactive/shared.py
(1 hunks)tests/test_cache_backend_execute.py
(8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
tests/test_cache_backend_execute.py (2)
executorlib/task_scheduler/file/hdf.py (1)
get_runtime
(74-88)executorlib/task_scheduler/file/shared.py (1)
FutureItem
(14-49)
executorlib/task_scheduler/file/backend.py (1)
executorlib/task_scheduler/file/hdf.py (1)
dump
(10-25)
⏰ Context from checks skipped due to timeout of 90000ms (15)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
- GitHub Check: unittest_mpich (macos-latest, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: unittest_flux_openmpi
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_flux_mpich
- GitHub Check: unittest_win
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
- GitHub Check: notebooks_integration
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
- GitHub Check: unittest_old
os.makedirs(cache_directory, exist_ok=True) | ||
file_name = os.path.join(cache_directory, task_key + "_o.h5") | ||
if task_key + "_o.h5" not in os.listdir(cache_directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Prefer os.path.exists()
over directory scans for existence checks
Using task_key + "_o.h5" not in os.listdir(cache_directory)
forces a full directory listing on every call.
os.path.exists(file_name)
(or Path(file_name).exists()
) is:
- O(1) rather than O(n) on large cache dirs.
- Immune to race conditions where a file is created between
listdir()
and the subsequent logic. - Clearer to read.
- file_name = os.path.join(cache_directory, task_key + "_o.h5")
- if task_key + "_o.h5" not in os.listdir(cache_directory):
+ file_name = os.path.join(cache_directory, f"{task_key}_o.h5")
+ if not os.path.exists(file_name):
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
os.makedirs(cache_directory, exist_ok=True) | |
file_name = os.path.join(cache_directory, task_key + "_o.h5") | |
if task_key + "_o.h5" not in os.listdir(cache_directory): | |
os.makedirs(cache_directory, exist_ok=True) | |
file_name = os.path.join(cache_directory, f"{task_key}_o.h5") | |
if not os.path.exists(file_name): |
🤖 Prompt for AI Agents
In executorlib/task_scheduler/interactive/shared.py around lines 154 to 156,
replace the check using 'task_key + "_o.h5" not in os.listdir(cache_directory)'
with 'not os.path.exists(file_name)'. This avoids scanning the entire directory,
improves performance, prevents race conditions, and makes the code clearer by
directly checking the file's existence.
if task_key + "_o.h5" not in os.listdir(cache_directory): | ||
os.makedirs(cache_directory, exist_ok=True) | ||
file_name = os.path.join(cache_directory, task_key + "_i.h5") | ||
dump(file_name=file_name, data_dict=data_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use exists()
instead of scanning the directory for every task
Similar to the interactive path, scanning os.listdir(cache_directory)
for each task does an unnecessary O(n) walk and risks TOCTOU issues.
- if task_key + "_o.h5" not in os.listdir(cache_directory):
+ output_path = os.path.join(cache_directory, f"{task_key}_o.h5")
+ if not os.path.exists(output_path):
You already compute the same path a few lines later; reuse it to keep the logic DRY.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if task_key + "_o.h5" not in os.listdir(cache_directory): | |
os.makedirs(cache_directory, exist_ok=True) | |
file_name = os.path.join(cache_directory, task_key + "_i.h5") | |
dump(file_name=file_name, data_dict=data_dict) | |
# check for the output file directly rather than listing the whole directory | |
output_path = os.path.join(cache_directory, f"{task_key}_o.h5") | |
if not os.path.exists(output_path): | |
os.makedirs(cache_directory, exist_ok=True) | |
file_name = os.path.join(cache_directory, f"{task_key}_i.h5") | |
dump(file_name=file_name, data_dict=data_dict) |
🤖 Prompt for AI Agents
In executorlib/task_scheduler/file/shared.py around lines 111 to 114, replace
the check that scans os.listdir(cache_directory) with a direct file existence
check using os.path.exists() for the target file to avoid inefficient directory
scans and TOCTOU issues. Also, reuse the computed file path variable instead of
reconstructing it to keep the code DRY.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #681 +/- ##
==========================================
- Coverage 96.78% 96.78% -0.01%
==========================================
Files 29 29
Lines 1307 1306 -1
==========================================
- Hits 1265 1264 -1
Misses 42 42 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary by CodeRabbit
_i.h5
,_o.h5
).