Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update scripts to pre-concat text and tokens #128

Merged
merged 76 commits into from
Feb 11, 2023

Conversation

samhavens
Copy link
Contributor

@samhavens samhavens commented Feb 7, 2023

Pre-concatenating based on text or tokens. Fixed the encoding issues; thanks numpy! There is probably something to be documented about how to serialize tensors to bytes for streaming...

Allows for streaming processing and train_small subset. There is a performance hit (at least on Mac) when tokenizing (~80x slower than not tokenizing)

Example commands:

  • No pre-concat: python convert_c4.py --out_root=out/folder
  • Pre-concat text: python convert_c4.py --out_root=out/folder --splits val train_small --concat_text=5000 --bos_text="</s>"
  • Pre-concat tokens: python convert_c4.py --out_root=out/folder --splits val train_small --concat_tokens=4096 --tokenizer=facebook/opt-125m
  • Pre-concat tokens (gpt2 works now): python convert_c4.py --out_root=out/folder --splits val train_small --concat_tokens=4096 --tokenizer=gpt2 --bos_text="<|endoftext|>"

Note that --splits is variadic so we can't use = with it

To use pre-concatenated text is easy,

from streaming.base import StreamingDataset

ds = StreamingDataset(local="out/folder", split="val")

for sample in ds:
    concatenated_text = sample['text']

For pre-concatenated tokens:

import torch
from streaming.base import StreamingDataset
from transformers import AutoTokenizer

# this has to be the same tokenizer they used to create the data
# do we want to enforce this?
tokenizer = AutoTokenizer.from_pretrained("your/tokenizer)

ds = StreamingDataset(local="mds-data-folder", split="val")

for sample in ds:
  # note, you need to copy the numpy array because the original is non-writeable
  # and torch does not support non-writeable tensors, so you get a scary warning and
  # if you do try to write to the tensor you get undefined behavior
  tokens = torch.from_numpy(np.frombuffer(ds[0]['tokens'], dtype=np.int64).copy())
  text = tokenizer.decode(tokens)

@samhavens
Copy link
Contributor Author

samhavens commented Feb 8, 2023

Converting this to use streaming is rough, the signature of the map function changes, and something about the mapping / processing I do breaks the dataset iterator...

EDIT: I do not appreciate huggingface datasets as much now that I have seen its streaming "support"

@samhavens samhavens marked this pull request as ready for review February 8, 2023 08:07
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed the code yet, just tried it out and works ok for me. confirmed training starts ok with truncate method and concat text or no concat. haven't tried concat tokens cause its not implemented yet. also haven't checked that the data is "right"

examples/common/convert_c4.py Outdated Show resolved Hide resolved
examples/common/convert_c4.py Show resolved Hide resolved
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
@samhavens
Copy link
Contributor Author

samhavens commented Feb 10, 2023

Request based on some customer issues:

If data_remote is None or is equal to data_local, instead of letting streaming attempt to build the dataset and raising a FileNotFound error about the index.json, before even instantiating the Dataset, check if the split directory is present in the data_local directory. If not, raise a ValueError that the split isn't present

@dakinggg
Copy link
Collaborator

Added the extra checking of split in local dir

Copy link
Contributor

@abhi-mosaic abhi-mosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been following this PR for a while, and the recent changes don't appear to break anything. Punting --concat_text to the next release (or whenever we actually use it) is the smart move.

Let's merge it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants