Skip to content

Feat: Clear cache if optimized dataset changes #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Aug 7, 2024

Conversation

deependujha
Copy link
Collaborator

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #292 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@deependujha deependujha marked this pull request as draft August 6, 2024 09:32
Copy link

codecov bot commented Aug 6, 2024

Codecov Report

Attention: Patch coverage is 87.17949% with 5 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@428752d). Learn more about missing BASE report.

Additional details and impacted files
@@          Coverage Diff          @@
##             main   #308   +/-   ##
=====================================
  Coverage        ?    78%           
=====================================
  Files           ?     34           
  Lines           ?   4995           
  Branches        ?      0           
=====================================
  Hits            ?   3893           
  Misses          ?   1102           
  Partials        ?      0           

@deependujha deependujha marked this pull request as ready for review August 6, 2024 11:32
@deependujha
Copy link
Collaborator Author

CI testing takes almost forever!

# download index.json file and read last_updation_timestamp
with tempfile.TemporaryDirectory() as tmp_directory:
temp_index_filepath = os.path.join(tmp_directory, _INDEX_FILENAME)
downloader = get_downloader_cls(input_dir.url, input_dir.path, []) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use tmp_directory as the cache directory here instead of input_dir.path?

Reason: The downloader might receive None as the cache directory. If the downloader uses this cache directory and finds it empty, it defaults to using the standard downloader cache, which could lead to a FileNotFoundError.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed, in the first run, the index file get's downloaded twice.
image

To reproduce maybe, we can test with s3 uploaded dataset. In my case, I was using the huggingface.(this feature is not available yet.)

@tchaton
Copy link
Collaborator

tchaton commented Aug 6, 2024

Details

It seems a test is hanging and there is no timeout.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Some comments.

# for backward compatibility, use the input_dir for hashing (if no timestamp is found)
last_updation_timestamp = input_dir if input_dir else ""

hash_object = hashlib.md5((last_updation_timestamp).encode()) # noqa: S324
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we don't want to combine input_dir / last_updated_at and delete any old last_update_at instead. This would make things more deterministic and avoid possible issues if 2 dataset had the same exact updated at.


if last_updation_timestamp == "":
# for backward compatibility, use the input_dir for hashing (if no timestamp is found)
last_updation_timestamp = input_dir if input_dir else ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
last_updation_timestamp = input_dir if input_dir else ""
last_updation_timestamp = input_dir if input_dir else ""

You are hashing the entire input_dir including the cache_dir which can be changed by the user. So you are effectively creating non-deterministic hashes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but, this is just for backward compatibility.

The original code was:

hash_object = hashlib.md5((input_dir or "").encode())

There's already input_dir or "". So, If last_timestamp is empty, just do as before.

Am I missing something?

updated_at = input_dir if input_dir else ""

dir_url_hash = hashlib.md5((resolved_input_dir.url or "").encode()).hexdigest() # noqa: S324
updated_at_hash = hashlib.md5((updated_at).encode()).hexdigest() # noqa: S324
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we don't even need to hash last_updated. This would make it more clear which one we are using.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great !

@deependujha deependujha merged commit d6f134b into Lightning-AI:main Aug 7, 2024
28 checks passed
@deependujha deependujha deleted the feat/clear-cache branch August 7, 2024 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add some mechanism to clear the cache
3 participants