-
Notifications
You must be signed in to change notification settings - Fork 67
Feat: Clear cache if optimized dataset changes #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #308 +/- ##
=====================================
Coverage ? 78%
=====================================
Files ? 34
Lines ? 4995
Branches ? 0
=====================================
Hits ? 3893
Misses ? 1102
Partials ? 0 |
CI testing takes almost forever! |
# download index.json file and read last_updation_timestamp | ||
with tempfile.TemporaryDirectory() as tmp_directory: | ||
temp_index_filepath = os.path.join(tmp_directory, _INDEX_FILENAME) | ||
downloader = get_downloader_cls(input_dir.url, input_dir.path, []) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use tmp_directory
as the cache directory here instead of input_dir.path
?
Reason: The downloader might receive None as the cache directory. If the downloader uses this cache directory and finds it empty, it defaults to using the standard downloader cache, which could lead to a FileNotFoundError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a test is hanging and there is no timeout. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Some comments.
# for backward compatibility, use the input_dir for hashing (if no timestamp is found) | ||
last_updation_timestamp = input_dir if input_dir else "" | ||
|
||
hash_object = hashlib.md5((last_updation_timestamp).encode()) # noqa: S324 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we don't want to combine input_dir / last_updated_at
and delete any old last_update_at instead. This would make things more deterministic and avoid possible issues if 2 dataset had the same exact updated at.
|
||
if last_updation_timestamp == "": | ||
# for backward compatibility, use the input_dir for hashing (if no timestamp is found) | ||
last_updation_timestamp = input_dir if input_dir else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_updation_timestamp = input_dir if input_dir else "" | |
last_updation_timestamp = input_dir if input_dir else "" |
You are hashing the entire input_dir including the cache_dir
which can be changed by the user. So you are effectively creating non-deterministic hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but, this is just for backward compatibility.
The original code was:
hash_object = hashlib.md5((input_dir or "").encode())
There's already input_dir or ""
. So, If last_timestamp is empty, just do as before.
Am I missing something?
Co-authored-by: thomas chaton <thomas@grid.ai>
for more information, see https://pre-commit.ci
updated_at = input_dir if input_dir else "" | ||
|
||
dir_url_hash = hashlib.md5((resolved_input_dir.url or "").encode()).hexdigest() # noqa: S324 | ||
updated_at_hash = hashlib.md5((updated_at).encode()).hexdigest() # noqa: S324 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we don't even need to hash last_updated
. This would make it more clear which one we are using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great !
Before submitting
What does this PR do?
Fixes #292 .
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃