Workaround to fix memory leak in HuggingFace tokenizer #169

soldni · 2024-06-04T00:07:40Z

Adds option to refresh tokenizer every few steps to get around the memory leak described here.

drschwenk

This looks good/ seems like it should work around the memory leak in the transformers library. One larger comment- it seems like most of the changes were needed to accommodate running with slow/fast tokenizers. If this is an orthogonal change to the memory leak issue, does it make sense to pull these changes out into a separate PR?

drschwenk · 2024-06-06T00:49:02Z

python/dolma/tokenizer/tokenizer.py

+                    if refresh_tokenizer_every:
+                        # extra copy to prevent memory leaks
+                        tokens = np.array(tokens, dtype=dtype)
+                    yield TokenizerOutput.from_tokens(id=row.id, src=path, loc=i, tokens=tokens)  # pyright: ignore
                i += 1


I know this wasn't changed/ not going to impact anything, but why is this index incremented here?

fixed! good catch even if it does nothing.

soldni · 2024-06-06T17:38:05Z

This looks good/ seems like it should work around the memory leak in the transformers library. One larger comment- it seems like most of the changes were needed to accommodate running with slow/fast tokenizers. If this is an orthogonal change to the memory leak issue, does it make sense to pull these changes out into a separate PR?

That's a valid concern, @drschwenk! However, GC hack doesn't fully deal w memory issues, so sometimes is necessary to use slow tokenizer instead 😭 hence, all in one PR.

soldni and others added 5 commits June 4, 2024 00:06

fixes memory leak

27f2e4d

added tests

2b09397

testing fixes

1fc72ea

fixed style

0a7220c

making copy, disablying token paral

4c57e75

soldni requested a review from drschwenk June 4, 2024 03:55

drschwenk approved these changes Jun 6, 2024

View reviewed changes

soldni added 2 commits June 6, 2024 10:42

remove increment

f7bb8f4

Merge branch 'main' into soldni/tokenizer-leak

0e3b317

soldni merged commit 64886d9 into main Jun 6, 2024
14 checks passed

soldni deleted the soldni/tokenizer-leak branch June 6, 2024 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround to fix memory leak in HuggingFace tokenizer #169

Workaround to fix memory leak in HuggingFace tokenizer #169

soldni commented Jun 4, 2024

drschwenk left a comment

drschwenk Jun 6, 2024

soldni Jun 6, 2024

soldni commented Jun 6, 2024

Workaround to fix memory leak in HuggingFace tokenizer #169

Workaround to fix memory leak in HuggingFace tokenizer #169

Conversation

soldni commented Jun 4, 2024

drschwenk left a comment

Choose a reason for hiding this comment

drschwenk Jun 6, 2024

Choose a reason for hiding this comment

soldni Jun 6, 2024

Choose a reason for hiding this comment

soldni commented Jun 6, 2024