Update scripts to pre-concat text and tokens #128

samhavens · 2023-02-07T18:59:18Z

Pre-concatenating based on text or tokens. Fixed the encoding issues; thanks numpy! There is probably something to be documented about how to serialize tensors to bytes for streaming...

Allows for streaming processing and train_small subset. There is a performance hit (at least on Mac) when tokenizing (~80x slower than not tokenizing)

Example commands:

No pre-concat: python convert_c4.py --out_root=out/folder
Pre-concat text: python convert_c4.py --out_root=out/folder --splits val train_small --concat_text=5000 --bos_text="</s>"
Pre-concat tokens: python convert_c4.py --out_root=out/folder --splits val train_small --concat_tokens=4096 --tokenizer=facebook/opt-125m
Pre-concat tokens (gpt2 works now): python convert_c4.py --out_root=out/folder --splits val train_small --concat_tokens=4096 --tokenizer=gpt2 --bos_text="<|endoftext|>"

Note that --splits is variadic so we can't use = with it

To use pre-concatenated text is easy,

from streaming.base import StreamingDataset

ds = StreamingDataset(local="out/folder", split="val")

for sample in ds:
    concatenated_text = sample['text']

For pre-concatenated tokens:

import torch
from streaming.base import StreamingDataset
from transformers import AutoTokenizer

# this has to be the same tokenizer they used to create the data
# do we want to enforce this?
tokenizer = AutoTokenizer.from_pretrained("your/tokenizer)

ds = StreamingDataset(local="mds-data-folder", split="val")

for sample in ds:
  # note, you need to copy the numpy array because the original is non-writeable
  # and torch does not support non-writeable tensors, so you get a scary warning and
  # if you do try to write to the tensor you get undefined behavior
  tokens = torch.from_numpy(np.frombuffer(ds[0]['tokens'], dtype=np.int64).copy())
  text = tokenizer.decode(tokens)

Really don't know what is getting messed up in the encoding...

…ts-to-pre-concat

samhavens · 2023-02-08T02:25:03Z

Converting this to use streaming is rough, the signature of the map function changes, and something about the mapping / processing I do breaks the dataset iterator...

EDIT: I do not appreciate huggingface datasets as much now that I have seen its streaming "support"

examples/common/convert_c4.py

…ts-to-pre-concat

dakinggg

Haven't reviewed the code yet, just tried it out and works ok for me. confirmed training starts ok with truncate method and concat text or no concat. haven't tried concat tokens cause its not implemented yet. also haven't checked that the data is "right"

examples/common/convert_c4.py

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

examples/common/text_data.py

examples/common/convert_c4.py

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

rename scripts, update README

samhavens · 2023-02-10T23:11:43Z

Request based on some customer issues:

If data_remote is None or is equal to data_local, instead of letting streaming attempt to build the dataset and raising a FileNotFound error about the index.json, before even instantiating the Dataset, check if the split directory is present in the data_local directory. If not, raise a ValueError that the split isn't present

dakinggg · 2023-02-11T00:05:31Z

Added the extra checking of split in local dir

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

abhi-mosaic

Been following this PR for a while, and the recent changes don't appear to break anything. Punting --concat_text to the next release (or whenever we actually use it) is the smart move.

Let's merge it!

samhavens added 6 commits February 3, 2023 17:15

not working

94d291a

concat works, tokenize concat doesn't

a7e286b

Really don't know what is getting messed up in the encoding...

clean up somewhat

bbc56f8

clean up

b813f61

clean comments

37a13c4

Merge remote-tracking branch 'upstream/main' into R-387--update-scrip…

833fad0

…ts-to-pre-concat

samhavens requested a review from abhi-mosaic February 7, 2023 18:59

works

eae8f36

abhi-mosaic assigned abhi-mosaic and unassigned abhi-mosaic Feb 7, 2023

samhavens added 2 commits February 7, 2023 13:27

quotes

5035151

update comment

d56d8f9

preconcat streaming works

466bf2d

samhavens marked this pull request as ready for review February 8, 2023 08:07

samhavens requested a review from dakinggg February 8, 2023 08:08

samhavens added 2 commits February 8, 2023 00:18

linting

1f19ae2

fix linux test

841db9a

abhi-mosaic reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

samhavens added 2 commits February 8, 2023 10:14

Merge remote-tracking branch 'upstream/main' into R-387--update-scrip…

ca432c8

…ts-to-pre-concat

use self.batch

c866eb9

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

examples/common/convert_c4.py Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

add max_seq_len kwarg

b3e618b

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

vchiley reviewed Feb 10, 2023

View reviewed changes

examples/common/text_data.py Show resolved Hide resolved

vchiley reviewed Feb 10, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

vchiley reviewed Feb 10, 2023

View reviewed changes

examples/common/convert_c4.py Show resolved Hide resolved

vchiley reviewed Feb 10, 2023

View reviewed changes

examples/common/convert_c4.py Show resolved Hide resolved

vchiley reviewed Feb 10, 2023

View reviewed changes

examples/common/convert_c4.py Show resolved Hide resolved

vchiley reviewed Feb 10, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

vchiley reviewed Feb 10, 2023

View reviewed changes

examples/common/convert_c4.py Outdated Show resolved Hide resolved

dakinggg and others added 4 commits February 10, 2023 13:44

Update examples/common/text_data.py

e58d2d3

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

Fix YAML referencing to allow better CLI overrides (#150)

31671ab

Update Benchmarks README, rename scripts (#152)

fe37b81

rename scripts, update README

remove unnecessary dataset attribute

487dfdc

Daniel King added 2 commits February 10, 2023 15:44

merge

bdaf30c

add split checking in local dir

90ae9f1

dakinggg and others added 10 commits February 10, 2023 16:21

Update examples/common/text_data.py

dbd1f99

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

set default args

51c1696

comment out optional commands

599cbdf

merge

d380736

remove _get_kwargs helper

9b4c9c0

fix spacing

6a8b487

fix test

47ec01c

remove pad setting

06b03d8

rip out concat text

832f262

fixes

73def16

abhi-mosaic approved these changes Feb 11, 2023

View reviewed changes

fix test

d27a559

dakinggg merged commit f3ba82d into mosaicml:main Feb 11, 2023

abhi-mosaic mentioned this pull request Feb 14, 2023

Fix key_padding_mask when using attn_impl='flash' #163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update scripts to pre-concat text and tokens #128

Update scripts to pre-concat text and tokens #128

Uh oh!

samhavens commented Feb 7, 2023 •

edited by abhi-mosaic

Loading

Uh oh!

samhavens commented Feb 8, 2023 •

edited

Loading

Uh oh!

Uh oh!

dakinggg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samhavens commented Feb 10, 2023 •

edited

Loading

Uh oh!

dakinggg commented Feb 11, 2023

Uh oh!

abhi-mosaic left a comment

Uh oh!

Uh oh!

Update scripts to pre-concat text and tokens #128

Update scripts to pre-concat text and tokens #128

Uh oh!

Conversation

samhavens commented Feb 7, 2023 • edited by abhi-mosaic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samhavens commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dakinggg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samhavens commented Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dakinggg commented Feb 11, 2023

Uh oh!

abhi-mosaic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samhavens commented Feb 7, 2023 •

edited by abhi-mosaic

Loading

samhavens commented Feb 8, 2023 •

edited

Loading

samhavens commented Feb 10, 2023 •

edited

Loading