Added pajama data generation #1

Vahe1994 · 2024-01-12T15:14:21Z

Added pajama data generation from togethercomputer/RedPajama-Data-1T…

…-Sample Co-authored-by: blackadder <scope.denis@mail.ru>

justheuristic · 2024-01-12T15:17:43Z

src/datautils.py

-        data = torch.load(f"./data/red_pajama_n=1024_{seqlen}_context_length.pth")[:nsamples]
-    elif name.lower() == "refinedweb":
+
+    if name.lower() == "refinedweb":
        data = torch.load("./data/refined_web_n=128.pth")[:nsamples]


I'd recommend that we remove refinedweb since we don't even know what it was tokenized with. If user really wants that dataset, they can directly set DATASET="./data/refined_web_n=128.pt" from CLI

Do we really intend to work with Falcon models?

we may eventually need 'em, but the user can still run falcon via DATASET=./data/... as long as they know what they're doing.

My concern is that someone might accidentally try training llama/mistral on main.py "refinedweb" without realizing it's model-specific

Co-authored-by: blackadder <scope.denis@mail.ru>

justheuristic · 2024-01-12T15:59:52Z

src/datautils.py

@@ -220,6 +237,8 @@ def get_loaders(name, nsamples=128, seed=0, seqlen=2048, eval_mode=False, model_

        if name.lower() == "wikitext2":
            data = get_wikitext2(nsamples, seqlen, tokenizer, eval_mode=eval_mode)
+        elif name.lower() == "pajama":
+            data = get_red_pajama(nsamples, seqlen, tokenizer)


please assert eval_mode here

Done, thank you

Co-authored-by: blackadder <scope.denis@mail.ru>

Vahe1994 and others added 2 commits January 12, 2024 19:13

Addded pajama data generation from togethercomputer/RedPajama-Data-1T…

499b1eb

…-Sample Co-authored-by: blackadder <scope.denis@mail.ru>

Addded pajama data generation from togethercomputer/RedPajama-Data-1T…

b98e7fb

…-Sample Co-authored-by: blackadder <scope.denis@mail.ru>

Vahe1994 requested review from justheuristic and Godofnothing January 12, 2024 15:16

justheuristic reviewed Jan 12, 2024

View reviewed changes

Vahe1994 and others added 4 commits January 12, 2024 19:22

Deleted refined_web_n=128.pth and loading from it

387c0d3

Co-authored-by: blackadder <scope.denis@mail.ru>

Added trange instead of range

9b37db8

Co-authored-by: blackadder <scope.denis@mail.ru>

Error correction

6de5195

Co-authored-by: blackadder <scope.denis@mail.ru>

Error correction

b60ba23

Co-authored-by: blackadder <scope.denis@mail.ru>

justheuristic reviewed Jan 12, 2024

View reviewed changes

justheuristic approved these changes Jan 12, 2024

View reviewed changes

Added assert on eval_mode

a3cce4a

Co-authored-by: blackadder <scope.denis@mail.ru>

Vahe1994 merged commit 0671bc7 into main Jan 12, 2024
2 checks passed

Vahe1994 deleted the pajama_data_generation branch January 12, 2024 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added pajama data generation #1

Added pajama data generation #1

Vahe1994 commented Jan 12, 2024

justheuristic Jan 12, 2024 •

edited

Loading

Vahe1994 Jan 12, 2024

Vahe1994 Jan 12, 2024

Godofnothing Jan 12, 2024

justheuristic Jan 12, 2024

justheuristic Jan 12, 2024

Vahe1994 Jan 12, 2024

Added pajama data generation #1

Added pajama data generation #1

Conversation

Vahe1994 commented Jan 12, 2024

justheuristic Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

Vahe1994 Jan 12, 2024

Choose a reason for hiding this comment

Vahe1994 Jan 12, 2024

Choose a reason for hiding this comment

Godofnothing Jan 12, 2024

Choose a reason for hiding this comment

justheuristic Jan 12, 2024

Choose a reason for hiding this comment

justheuristic Jan 12, 2024

Choose a reason for hiding this comment

Vahe1994 Jan 12, 2024

Choose a reason for hiding this comment

justheuristic Jan 12, 2024 •

edited

Loading