-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround to fix memory leak in HuggingFace tokenizer #169
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good/ seems like it should work around the memory leak in the transformers library. One larger comment- it seems like most of the changes were needed to accommodate running with slow/fast tokenizers. If this is an orthogonal change to the memory leak issue, does it make sense to pull these changes out into a separate PR?
python/dolma/tokenizer/tokenizer.py
Outdated
if refresh_tokenizer_every: | ||
# extra copy to prevent memory leaks | ||
tokens = np.array(tokens, dtype=dtype) | ||
yield TokenizerOutput.from_tokens(id=row.id, src=path, loc=i, tokens=tokens) # pyright: ignore | ||
i += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this wasn't changed/ not going to impact anything, but why is this index incremented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed! good catch even if it does nothing.
That's a valid concern, @drschwenk! However, GC hack doesn't fully deal w memory issues, so sometimes is necessary to use slow tokenizer instead 😭 hence, all in one PR. |
Adds option to refresh tokenizer every few steps to get around the memory leak described here.