-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added pajama data generation #1
Conversation
…-Sample Co-authored-by: blackadder <scope.denis@mail.ru>
…-Sample Co-authored-by: blackadder <scope.denis@mail.ru>
src/datautils.py
Outdated
data = torch.load(f"./data/red_pajama_n=1024_{seqlen}_context_length.pth")[:nsamples] | ||
elif name.lower() == "refinedweb": | ||
|
||
if name.lower() == "refinedweb": | ||
data = torch.load("./data/refined_web_n=128.pth")[:nsamples] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend that we remove refinedweb since we don't even know what it was tokenized with. If user really wants that dataset, they can directly set DATASET="./data/refined_web_n=128.pt" from CLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really intend to work with Falcon models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may eventually need 'em, but the user can still run falcon via DATASET=./data/... as long as they know what they're doing.
My concern is that someone might accidentally try training llama/mistral on main.py "refinedweb" without realizing it's model-specific
Co-authored-by: blackadder <scope.denis@mail.ru>
Co-authored-by: blackadder <scope.denis@mail.ru>
Co-authored-by: blackadder <scope.denis@mail.ru>
Co-authored-by: blackadder <scope.denis@mail.ru>
src/datautils.py
Outdated
@@ -220,6 +237,8 @@ def get_loaders(name, nsamples=128, seed=0, seqlen=2048, eval_mode=False, model_ | |||
|
|||
if name.lower() == "wikitext2": | |||
data = get_wikitext2(nsamples, seqlen, tokenizer, eval_mode=eval_mode) | |||
elif name.lower() == "pajama": | |||
data = get_red_pajama(nsamples, seqlen, tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please assert eval_mode here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you
Co-authored-by: blackadder <scope.denis@mail.ru>
Added pajama data generation from togethercomputer/RedPajama-Data-1T…