I changed ConcatTokensDataset.__iter__ to this:

def __iter__(self) -> Iterable[Dict[str, bytes]]:

        buffer = []
        # self.write_batch_size = 10_000
        shards = self.hf_dataset.num_rows // self.write_batch_size + 1
        for i in range(shards):
            shard = self.hf_dataset[
                i * self.write_batch_size : (i + 1) * self.write_batch_size
            ]
            encoded_shard = self.tokenizer(
                shard["text"], truncation=False, padding=False
            )
            for encoded in encoded_shard["input_ids"]:
                iids = encoded  # ['input_ids']
                buffer = buffer + self.bos_tokens + iids + self.eos_tokens
                while len(buffer) >= self.max_length:
                    concat_sample = buffer[: self.max_length]
                    buffer = buffer[self.max_length :] if self.should_wrap else []
                    yield {
                        # convert to bytes to store in MDS binary format
                        "tokens": np.asarray(concat_sample).tobytes(),
                        "num_tokens": len(concat_sample),
                    }

Processing 7B tokens takes around 20 hours with the original code and 30 min with this change. It's not very robust though and doesn't scale very well: a fast tokenizer hangs after a while with very long text and more than 16 threads seem not to give you any speedup.

Thanks for your update! Do you modify other files to enable multithread?

How to support multi-threaded parallel data preprocessing? #870

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions