Skip to content

Commit f3ba82d

Browse files
authored
Update scripts to pre-concat tokens (#128)
1 parent 6c2620f commit f3ba82d

14 files changed

+388
-193
lines changed

examples/bert/README.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -79,9 +79,12 @@ You can read more about the benefits of using mosaicml-streaming [here](https://
7979
To make yourself a copy of C4, use `convert_c4.py` like so:
8080

8181
```bash
82-
# Download the 'train_small', 'val' splits and convert to StreamingDataset format
83-
# This will take 20 sec to 1 min depending on your Internet bandwidth
84-
# You should see two folders `./my-copy-c4/train_small` and `./my-copy-c4/val` that are each ~0.5GB
82+
# Download the 'train_small' and 'val' splits and convert to StreamingDataset format
83+
# This will take 20-60 seconds depending on your Internet bandwidth
84+
# You should see two folders: `./my-copy-c4/train_small` and `./my-copy-c4/val` that are each ~0.5GB
85+
# Note: for BERT we are not doing any concatenation of samples, so we do not use the `--concat_tokens`
86+
# option here. Instead, samples will simply get padded or truncated to the max sequence length
87+
# in the dataloader
8588
python ../common/convert_c4.py --out_root ./my-copy-c4 --splits train_small val
8689

8790
# Download the 'train' split if you really want to train the model (not just profile)
@@ -90,7 +93,7 @@ python ../common/convert_c4.py --out_root ./my-copy-c4 --splits train_small val
9093
# python ../common/convert_c4.py --out_root ./my-copy-c4 --splits train
9194

9295
# For any of the above commands, you can also choose to compress the .mds files.
93-
# This is useful if your plan is to store these in an object store after conversion.
96+
# This is useful if your plan is to store these in object store after conversion.
9497
# python ../common/convert_c4.py ... --compression zstd
9598
```
9699

@@ -104,16 +107,12 @@ To verify that the dataloader works, run a quick test on your `val` split like s
104107
# This will construct a `StreamingTextDataset` dataset from your `val` split,
105108
# pass it into a PyTorch Dataloader, and iterate over it and print samples.
106109
# Since we only provide a local path, no streaming/copying takes place.
107-
python ../common/text_data.py ./my-copy-c4
108-
```
110+
python ../common/text_data.py --local_path ./my-copy-c4 --tokenizer bert-base-uncased
109111

110-
The streaming dataloader is also particularly useful when your dataset has been moved to a central location.
111-
For example:
112-
```bash
113112
# This will do the same thing, but stream data to {local} from {remote}.
114113
# The remote path can be a filesystem or object store URI.
115-
python ../common/text_data.py /tmp/cache-c4 ./my-copy-c4 # stream from filesystem, e.g. a slow NFS volume to fast local disk
116-
python ../common/text_data.py /tmp/cache-c4 s3://my-bucket/my-copy-c4 # stream from object store
114+
python ../common/text_data.py --local_path /tmp/cache-c4 --remote_path ./my-copy-c4 --tokenizer bert-base-uncased # stream from filesystem, e.g. a slow NFS volume to fast local disk
115+
# python ../common/text_data.py --local_path /tmp/cache-c4 --remote_path s3://my-bucket/my-copy-c4 --tokenizer bert-base-uncased # stream from object store
117116
```
118117

119118
With our data prepared, we can now start training.

examples/bert/requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
datasets==2.7.1
21
einops==0.5.0
32
torch==1.13.1
43
mosaicml==0.12.1

examples/bert/tests/smoketest_config_main.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ train_loader:
2121
split: train
2222
tokenizer_name: ${tokenizer_name}
2323
max_seq_len: ${max_seq_len}
24-
group_method: truncate
2524
predownload: 1000
2625
shuffle: true
2726
mlm_probability: ${mlm_probability}
@@ -37,7 +36,6 @@ eval_loader:
3736
split: val
3837
tokenizer_name: ${tokenizer_name}
3938
max_seq_len: ${max_seq_len}
40-
group_method: truncate
4139
predownload: 1000
4240
shuffle: false
4341
mlm_probability: ${mlm_probability}

examples/bert/yamls/test/main.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ train_loader:
2828
split: train_small
2929
tokenizer_name: ${tokenizer_name}
3030
max_seq_len: ${max_seq_len}
31-
group_method: truncate
3231
predownload: 1000
3332
shuffle: true
3433
mlm_probability: ${mlm_probability}
@@ -43,7 +42,6 @@ eval_loader:
4342
split: val
4443
tokenizer_name: ${tokenizer_name}
4544
max_seq_len: ${max_seq_len}
46-
group_method: truncate
4745
predownload: 1000
4846
shuffle: false
4947
mlm_probability: ${mlm_probability}
@@ -82,7 +80,6 @@ progress_bar: false
8280
log_to_console: true
8381
console_log_interval: 1ba
8482

85-
8683
callbacks:
8784
speed_monitor:
8885
window_size: 5

0 commit comments

Comments
 (0)