Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround to fix memory leak in HuggingFace tokenizer #169

Merged
merged 7 commits into from
Jun 6, 2024

Conversation

soldni
Copy link
Member

@soldni soldni commented Jun 4, 2024

Adds option to refresh tokenizer every few steps to get around the memory leak described here.

@soldni soldni requested a review from drschwenk June 4, 2024 03:55
Copy link
Contributor

@drschwenk drschwenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good/ seems like it should work around the memory leak in the transformers library. One larger comment- it seems like most of the changes were needed to accommodate running with slow/fast tokenizers. If this is an orthogonal change to the memory leak issue, does it make sense to pull these changes out into a separate PR?

if refresh_tokenizer_every:
# extra copy to prevent memory leaks
tokens = np.array(tokens, dtype=dtype)
yield TokenizerOutput.from_tokens(id=row.id, src=path, loc=i, tokens=tokens) # pyright: ignore
i += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this wasn't changed/ not going to impact anything, but why is this index incremented here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! good catch even if it does nothing.

@soldni
Copy link
Member Author

soldni commented Jun 6, 2024

This looks good/ seems like it should work around the memory leak in the transformers library. One larger comment- it seems like most of the changes were needed to accommodate running with slow/fast tokenizers. If this is an orthogonal change to the memory leak issue, does it make sense to pull these changes out into a separate PR?

That's a valid concern, @drschwenk! However, GC hack doesn't fully deal w memory issues, so sometimes is necessary to use slow tokenizer instead 😭 hence, all in one PR.

@soldni soldni merged commit 64886d9 into main Jun 6, 2024
14 checks passed
@soldni soldni deleted the soldni/tokenizer-leak branch June 6, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants