-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update scripts to pre-concat text and tokens #128
Update scripts to pre-concat text and tokens #128
Conversation
Really don't know what is getting messed up in the encoding...
Converting this to use streaming is rough, the signature of the map function changes, and something about the mapping / processing I do breaks the dataset iterator... EDIT: I do not appreciate huggingface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed the code yet, just tried it out and works ok for me. confirmed training starts ok with truncate method and concat text or no concat. haven't tried concat tokens cause its not implemented yet. also haven't checked that the data is "right"
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
rename scripts, update README
Request based on some customer issues: If |
Added the extra checking of split in local dir |
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been following this PR for a while, and the recent changes don't appear to break anything. Punting --concat_text
to the next release (or whenever we actually use it) is the smart move.
Let's merge it!
Pre-concatenating based on text or tokens. Fixed the encoding issues; thanks
numpy
! There is probably something to be documented about how to serialize tensors to bytes for streaming...Allows for streaming processing and
train_small
subset. There is a performance hit (at least on Mac) when tokenizing (~80x slower than not tokenizing)Example commands:
python convert_c4.py --out_root=out/folder
python convert_c4.py --out_root=out/folder --splits val train_small --concat_text=5000 --bos_text="</s>"
python convert_c4.py --out_root=out/folder --splits val train_small --concat_tokens=4096 --tokenizer=facebook/opt-125m
python convert_c4.py --out_root=out/folder --splits val train_small --concat_tokens=4096 --tokenizer=gpt2 --bos_text="<|endoftext|>"
Note that
--splits
is variadic so we can't use=
with itTo use pre-concatenated text is easy,
For pre-concatenated tokens: