Feat: Clear cache if optimized dataset changes #308

deependujha · 2024-08-06T09:28:16Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #292 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

src/litdata/streaming/writer.py

for more information, see https://pre-commit.ci

codecov · 2024-08-06T11:12:48Z

Codecov Report

Attention: Patch coverage is 87.17949% with 5 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@428752d). Learn more about missing BASE report.

Additional details and impacted files

@@          Coverage Diff          @@
##             main   #308   +/-   ##
=====================================
  Coverage        ?    78%           
=====================================
  Files           ?     34           
  Lines           ?   4995           
  Branches        ?      0           
=====================================
  Hits            ?   3893           
  Misses          ?   1102           
  Partials        ?      0

deependujha · 2024-08-06T11:32:44Z

CI testing takes almost forever!

bhimrazy · 2024-08-06T13:27:49Z

src/litdata/utilities/dataset_utilities.py

+        # download index.json file and read last_updation_timestamp
+        with tempfile.TemporaryDirectory() as tmp_directory:
+            temp_index_filepath = os.path.join(tmp_directory, _INDEX_FILENAME)
+            downloader = get_downloader_cls(input_dir.url, input_dir.path, [])  # type: ignore


Could we use tmp_directory as the cache directory here instead of input_dir.path?

Reason: The downloader might receive None as the cache directory. If the downloader uses this cache directory and finds it empty, it defaults to using the standard downloader cache, which could lead to a FileNotFoundError.

I also noticed, in the first run, the index file get's downloaded twice.

To reproduce maybe, we can test with s3 uploaded dataset. In my case, I was using the huggingface.(this feature is not available yet.)

tchaton · 2024-08-06T14:31:24Z

Details

It seems a test is hanging and there is no timeout.

for more information, see https://pre-commit.ci

tchaton

Looks great. Some comments.

src/litdata/utilities/dataset_utilities.py

tchaton · 2024-08-07T06:26:02Z

src/litdata/utilities/dataset_utilities.py

+        # for backward compatibility, use the input_dir for hashing (if no timestamp is found)
+        last_updation_timestamp = input_dir if input_dir else ""
+
+    hash_object = hashlib.md5((last_updation_timestamp).encode())  # noqa: S324


I wonder if we don't want to combine input_dir / last_updated_at and delete any old last_update_at instead. This would make things more deterministic and avoid possible issues if 2 dataset had the same exact updated at.

tchaton · 2024-08-07T06:28:33Z

src/litdata/utilities/dataset_utilities.py

+
+    if last_updation_timestamp == "":
+        # for backward compatibility, use the input_dir for hashing (if no timestamp is found)
+        last_updation_timestamp = input_dir if input_dir else ""


Suggested change

last_updation_timestamp = input_dir if input_dir else ""

last_updation_timestamp = input_dir if input_dir else ""

You are hashing the entire input_dir including the cache_dir which can be changed by the user. So you are effectively creating non-deterministic hashes.

but, this is just for backward compatibility.

The original code was:

hash_object = hashlib.md5((input_dir or "").encode())

There's already input_dir or "". So, If last_timestamp is empty, just do as before.

Am I missing something?

Co-authored-by: thomas chaton <thomas@grid.ai>

for more information, see https://pre-commit.ci

tchaton · 2024-08-07T07:05:54Z

src/litdata/utilities/dataset_utilities.py

+        updated_at = input_dir if input_dir else ""
+
+    dir_url_hash = hashlib.md5((resolved_input_dir.url or "").encode()).hexdigest()  # noqa: S324
+    updated_at_hash = hashlib.md5((updated_at).encode()).hexdigest()  # noqa: S324


Maybe, we don't even need to hash last_updated. This would make it more clear which one we are using.

tchaton

Looks great !

working

088eed0

deependujha requested review from tchaton and awaelchli as code owners August 6, 2024 09:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

23f6c9e

for more information, see https://pre-commit.ci

deependujha marked this pull request as draft August 6, 2024 09:32

deependujha and others added 3 commits August 6, 2024 15:03

backward compatibility

cd8f6b0

nitpick

1ad33d4

[pre-commit.ci] auto fixes from pre-commit.com hooks

47b5035

for more information, see https://pre-commit.ci

tchaton reviewed Aug 6, 2024

View reviewed changes

src/litdata/streaming/writer.py Outdated Show resolved Hide resolved

deependujha and others added 4 commits August 6, 2024 15:34

change key

217626b

remove fork

b272e21

fix mypy errors

f105b4d

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3708a3

for more information, see https://pre-commit.ci

deependujha marked this pull request as ready for review August 6, 2024 11:32

bhimrazy reviewed Aug 6, 2024

View reviewed changes

deependujha and others added 10 commits August 6, 2024 15:48

add timeout in test function

b82c1de

[pre-commit.ci] auto fixes from pre-commit.com hooks

56d88dc

for more information, see https://pre-commit.ci

fix failing tests

4d4f6e5

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b61c51

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/clear-cache

bb0e377

add mock function to resolve dataset dir

88e15c6

remove meow

16149b3

[pre-commit.ci] auto fixes from pre-commit.com hooks

851df41

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/clear-cache

3f84ef2

Merge branch 'main' into feat/clear-cache

5416474

tchaton reviewed Aug 7, 2024

View reviewed changes

deependujha and others added 4 commits August 7, 2024 11:59

Update src/litdata/utilities/dataset_utilities.py

d80b82b

Co-authored-by: thomas chaton <thomas@grid.ai>

update

82d1433

clear cache

6534d63

[pre-commit.ci] auto fixes from pre-commit.com hooks

67cb570

for more information, see https://pre-commit.ci

tchaton reviewed Aug 7, 2024

View reviewed changes

don't hash updated_at

8a7134b

tchaton approved these changes Aug 7, 2024

View reviewed changes

deependujha added 2 commits August 7, 2024 13:06

fix

2a8d3ae

update

56f115b

deependujha merged commit d6f134b into Lightning-AI:main Aug 7, 2024
28 checks passed

deependujha deleted the feat/clear-cache branch August 7, 2024 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Clear cache if optimized dataset changes #308

Feat: Clear cache if optimized dataset changes #308

Uh oh!

deependujha commented Aug 6, 2024

Uh oh!

Uh oh!

codecov bot commented Aug 6, 2024 •

edited

Loading

Uh oh!

deependujha commented Aug 6, 2024

Uh oh!

bhimrazy Aug 6, 2024

Uh oh!

bhimrazy Aug 6, 2024

Uh oh!

tchaton commented Aug 6, 2024

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton Aug 7, 2024

Uh oh!

tchaton Aug 7, 2024

Uh oh!

deependujha Aug 7, 2024

Uh oh!

tchaton Aug 7, 2024

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Uh oh!

	last_updation_timestamp = input_dir if input_dir else ""
	last_updation_timestamp = input_dir if input_dir else ""

Feat: Clear cache if optimized dataset changes #308

Feat: Clear cache if optimized dataset changes #308

Uh oh!

Conversation

deependujha commented Aug 6, 2024

What does this PR do?

PR review

Did you have fun?

Uh oh!

Uh oh!

codecov bot commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

deependujha commented Aug 6, 2024

Uh oh!

bhimrazy Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

bhimrazy Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

tchaton commented Aug 6, 2024

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

tchaton Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

deependujha Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

tchaton Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Aug 6, 2024 •

edited

Loading