Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added pajama data generation #1

Merged
merged 7 commits into from
Jan 12, 2024
Merged

Added pajama data generation #1

merged 7 commits into from
Jan 12, 2024

Conversation

Vahe1994
Copy link
Owner

Added pajama data generation from togethercomputer/RedPajama-Data-1T…

Vahe1994 and others added 2 commits January 12, 2024 19:13
…-Sample

Co-authored-by: blackadder <scope.denis@mail.ru>
…-Sample

Co-authored-by: blackadder <scope.denis@mail.ru>
src/datautils.py Outdated
data = torch.load(f"./data/red_pajama_n=1024_{seqlen}_context_length.pth")[:nsamples]
elif name.lower() == "refinedweb":

if name.lower() == "refinedweb":
data = torch.load("./data/refined_web_n=128.pth")[:nsamples]
Copy link
Collaborator

@justheuristic justheuristic Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that we remove refinedweb since we don't even know what it was tokenized with. If user really wants that dataset, they can directly set DATASET="./data/refined_web_n=128.pt" from CLI

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really intend to work with Falcon models?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may eventually need 'em, but the user can still run falcon via DATASET=./data/... as long as they know what they're doing.

My concern is that someone might accidentally try training llama/mistral on main.py "refinedweb" without realizing it's model-specific

Vahe1994 and others added 4 commits January 12, 2024 19:22
Co-authored-by: blackadder <scope.denis@mail.ru>
Co-authored-by: blackadder <scope.denis@mail.ru>
Co-authored-by: blackadder <scope.denis@mail.ru>
Co-authored-by: blackadder <scope.denis@mail.ru>
src/datautils.py Outdated
@@ -220,6 +237,8 @@ def get_loaders(name, nsamples=128, seed=0, seqlen=2048, eval_mode=False, model_

if name.lower() == "wikitext2":
data = get_wikitext2(nsamples, seqlen, tokenizer, eval_mode=eval_mode)
elif name.lower() == "pajama":
data = get_red_pajama(nsamples, seqlen, tokenizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please assert eval_mode here

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you

Co-authored-by: blackadder <scope.denis@mail.ru>
@Vahe1994 Vahe1994 merged commit 0671bc7 into main Jan 12, 2024
2 checks passed
@Vahe1994 Vahe1994 deleted the pajama_data_generation branch January 12, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants