Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update scripts to pre-concat text and tokens #128

Merged
merged 76 commits into from
Feb 11, 2023
Merged
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
94d291a
not working
samhavens Feb 4, 2023
a7e286b
concat works, tokenize concat doesn't
samhavens Feb 7, 2023
bbc56f8
clean up somewhat
samhavens Feb 7, 2023
b813f61
clean up
samhavens Feb 7, 2023
37a13c4
clean comments
samhavens Feb 7, 2023
833fad0
Merge remote-tracking branch 'upstream/main' into R-387--update-scrip…
samhavens Feb 7, 2023
eae8f36
works
samhavens Feb 7, 2023
5035151
quotes
samhavens Feb 7, 2023
d56d8f9
update comment
samhavens Feb 7, 2023
466bf2d
preconcat streaming works
samhavens Feb 8, 2023
1f19ae2
linting
samhavens Feb 8, 2023
841db9a
fix linux test
samhavens Feb 8, 2023
ca432c8
Merge remote-tracking branch 'upstream/main' into R-387--update-scrip…
samhavens Feb 8, 2023
c866eb9
use self.batch
samhavens Feb 8, 2023
81c59d5
some requested changes still failing [skip ci]
samhavens Feb 8, 2023
b7a1dc9
update docstring
samhavens Feb 8, 2023
08c00fd
sep to BOS EOS, clean up buffers,
samhavens Feb 8, 2023
82ef04b
re-allow no concat option
samhavens Feb 8, 2023
2f04afb
update datasets to version that works
samhavens Feb 8, 2023
c8fb3f6
remove group_method from StreamingTextDataset
samhavens Feb 8, 2023
bbe2cb9
remove bert and llm explicit version dep on datasets
samhavens Feb 8, 2023
6d8072b
Update examples/common/text_data.py
samhavens Feb 8, 2023
0885f58
tell the user stuff
samhavens Feb 8, 2023
592388f
deprecate group_method
samhavens Feb 9, 2023
f7b90b7
bert don't use group_method
samhavens Feb 9, 2023
1a26271
remove group_method from other bert yamls
samhavens Feb 9, 2023
8934ede
Update examples/common/convert_c4.py
samhavens Feb 9, 2023
ec4d49a
my editor automformatted the yamls wrong
samhavens Feb 9, 2023
64b6e60
lint
samhavens Feb 9, 2023
de0e0f1
fix no concat
abhi-mosaic Feb 9, 2023
69e75f1
fix concat_text
abhi-mosaic Feb 9, 2023
0f874e7
fix tokens
abhi-mosaic Feb 9, 2023
9ab78bf
add --wrap
abhi-mosaic Feb 9, 2023
e1bad1a
move pad exception
abhi-mosaic Feb 9, 2023
8b25588
revert max_length 1e30, fix attention_mask=None case
abhi-mosaic Feb 9, 2023
067e7b0
wrap -> no_wrap
dakinggg Feb 9, 2023
f5a9259
switch to opt tokenizer so it works with concat text
dakinggg Feb 9, 2023
6b4bb14
fix mosaic_gpt attention_mask
abhi-mosaic Feb 9, 2023
82e1beb
disable tokenizer parallelism within CPU workers
abhi-mosaic Feb 9, 2023
171014a
revert debugging change
dakinggg Feb 9, 2023
e212269
fix up readmes and add new options to text_data.py
dakinggg Feb 9, 2023
6b81013
fix up warnings
dakinggg Feb 9, 2023
35ce41c
allow script to write new splits in same dir
dakinggg Feb 9, 2023
250907a
remove pdb
dakinggg Feb 9, 2023
759efc4
fix bad arg checking
dakinggg Feb 9, 2023
89bcda5
fix the error
dakinggg Feb 9, 2023
d0e3fb4
remove extra if
dakinggg Feb 9, 2023
fec89c3
merge and fix bert instrutions
dakinggg Feb 10, 2023
d86aa98
adjust readme
dakinggg Feb 10, 2023
bff6b1d
fix another command
dakinggg Feb 10, 2023
17614a8
more readme fixes
dakinggg Feb 10, 2023
16ad6f0
add back the tokenizer arg too
dakinggg Feb 10, 2023
806d906
Update examples/common/text_data.py
dakinggg Feb 10, 2023
526ca3c
bos -> eos
dakinggg Feb 10, 2023
425d473
Update examples/common/text_data.py
dakinggg Feb 10, 2023
524556d
Merge branch 'main' into R-387--update-scripts-to-pre-concat
dakinggg Feb 10, 2023
4a7ab81
try commenting out min_params
dakinggg Feb 10, 2023
299588a
revert opt min params commented out
dakinggg Feb 10, 2023
b3e618b
add max_seq_len kwarg
samhavens Feb 10, 2023
e58d2d3
Update examples/common/text_data.py
dakinggg Feb 10, 2023
31671ab
Fix YAML referencing to allow better CLI overrides (#150)
abhi-mosaic Feb 10, 2023
fe37b81
Update Benchmarks README, rename scripts (#152)
hanlint Feb 10, 2023
487dfdc
remove unnecessary dataset attribute
dakinggg Feb 10, 2023
bdaf30c
merge
dakinggg Feb 10, 2023
90ae9f1
add split checking in local dir
dakinggg Feb 11, 2023
dbd1f99
Update examples/common/text_data.py
dakinggg Feb 11, 2023
51c1696
set default args
dakinggg Feb 11, 2023
599cbdf
comment out optional commands
dakinggg Feb 11, 2023
d380736
merge
dakinggg Feb 11, 2023
9b4c9c0
remove _get_kwargs helper
dakinggg Feb 11, 2023
6a8b487
fix spacing
dakinggg Feb 11, 2023
47ec01c
fix test
dakinggg Feb 11, 2023
06b03d8
remove pad setting
dakinggg Feb 11, 2023
832f262
rip out concat text
dakinggg Feb 11, 2023
73def16
fixes
dakinggg Feb 11, 2023
d27a559
fix test
dakinggg Feb 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions examples/bert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,12 @@ You can read more about the benefits of using mosaicml-streaming [here](https://
To make yourself a copy of C4, use `convert_c4.py` like so:

```bash
# Download the 'train_small', 'val' splits and convert to StreamingDataset format
# This will take 20 sec to 1 min depending on your Internet bandwidth
# You should see two folders `./my-copy-c4/train_small` and `./my-copy-c4/val` that are each ~0.5GB
# Download the 'train_small' and 'val' splits and convert to StreamingDataset format
# This will take 20-60 seconds depending on your Internet bandwidth
# You should see two folders: `./my-copy-c4/train_small` and `./my-copy-c4/val` that are each ~0.5GB
# Note: for BERT we are not doing any concatenation of samples, so we do not use the `--concat_tokens`
# or `--concat_text` options here. Instead, samples will simply get padded or truncated to the max sequence length
# in the dataloader
python ../common/convert_c4.py --out_root ./my-copy-c4 --splits train_small val

# Download the 'train' split if you really want to train the model (not just profile)
Expand All @@ -65,7 +68,7 @@ python ../common/convert_c4.py --out_root ./my-copy-c4 --splits train_small val
# python ../common/convert_c4.py --out_root ./my-copy-c4 --splits train

# For any of the above commands, you can also choose to compress the .mds files.
# This is useful if your plan is to store these in an object store after conversion.
# This is useful if your plan is to store these in object store after conversion.
# python ../common/convert_c4.py ... --compression zstd
```

Expand All @@ -79,12 +82,12 @@ To verify that the dataloader works, run a quick test on your `val` split like s
# This will construct a `StreamingTextDataset` dataset from your `val` split,
# pass it into a PyTorch Dataloader, and iterate over it and print samples.
# Since we only provide a local path, no streaming/copying takes place.
python ../common/text_data.py ./my-copy-c4
python ../common/text_data.py --local_path ./my-copy-c4 --tokenizer bert-base-uncased

# This will do the same thing, but stream data to {local} from {remote}.
# The remote path can be a filesystem or object store URI.
python ../common/text_data.py /tmp/cache-c4 ./my-copy-c4 # stream from filesystem, e.g. a slow NFS volume to fast local disk
python ../common/text_data.py /tmp/cache-c4 s3://my-bucket/my-copy-c4 # stream from object store
python ../common/text_data.py --local_path /tmp/cache-c4 --remote_path ./my-copy-c4 --tokenizer bert-base-uncased # stream from filesystem, e.g. a slow NFS volume to fast local disk
python ../common/text_data.py --local_path /tmp/cache-c4 --remote_path s3://my-bucket/my-copy-c4 --tokenizer bert-base-uncased # stream from object store
dakinggg marked this conversation as resolved.
Show resolved Hide resolved
```

With our data prepared, we can now start training.
Expand Down
1 change: 0 additions & 1 deletion examples/bert/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
datasets==2.7.1
einops==0.5.0
torch==1.13.1
mosaicml==0.12.1
Expand Down
2 changes: 0 additions & 2 deletions examples/bert/tests/smoketest_config_main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ train_loader:
split: train
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: truncate
predownload: 1000
shuffle: true
mlm_probability: *mlm_probability
Expand All @@ -37,7 +36,6 @@ eval_loader:
split: val
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: truncate
predownload: 1000
shuffle: false
mlm_probability: *mlm_probability
Expand Down
3 changes: 0 additions & 3 deletions examples/bert/yamls/test/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ train_loader:
split: train_small
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: truncate
predownload: 1000
shuffle: true
mlm_probability: *mlm_probability
Expand All @@ -43,7 +42,6 @@ eval_loader:
split: val
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: truncate
predownload: 1000
shuffle: false
mlm_probability: *mlm_probability
Expand Down Expand Up @@ -82,7 +80,6 @@ progress_bar: false
log_to_console: true
console_log_interval: 1ba


callbacks:
speed_monitor:
window_size: 5
Expand Down
Loading