How to speed up "Generating train split" #6205

aihao2000 · 2023-09-04T03:50:45Z

aihao2000
Sep 4, 2023

How to speed up "Generating train split". I used num_proc but the prompt

Setting num_proc from 8 back to 1 for the train split to disable multiprocessing as it only contains one shard.
09/04/2023 11:40:49 - WARNING - datasets.builder - Setting num_proc from 8 back to 1 for the train split to disable  multiprocessing as it only contains one shard

Feature is the image of a data set, this is too slow

Answered by mariosasko

Sep 4, 2023

To parallelize the loading, the gen_kwargs requires a list that can be split into num_proc parts (shards), which are then passed to the generator (e.g., pass a list of image files or a list of directories (with the images) to parallelize over them)

View full answer

mariosasko · 2023-09-04T14:50:20Z

mariosasko
Sep 4, 2023
Collaborator

To parallelize the loading, the gen_kwargs requires a list that can be split into num_proc parts (shards), which are then passed to the generator (e.g., pass a list of image files or a list of directories (with the images) to parallelize over them)

0 replies

aihao2000 · 2023-09-06T04:48:12Z

aihao2000
Sep 6, 2023
Author

@mariosasko I passed in a path list, but now the progress is not shown: Generating train split: 0/0 [00:00<?,? examples/s], is this normal?

1 reply

mariosasko Sep 6, 2023
Collaborator

I think this means there are no examples to yield. Can you share a loading script with us? We can generate dummy data to replicate the issue if the original data is private.

aihao2000 · 2023-09-06T17:05:03Z

aihao2000
Sep 6, 2023
Author

@mariosasko

import datasets
import os
from PIL import Image
import json
import torch
import cv2
import numpy as np


class ImagesConfig(datasets.BuilderConfig):
    def __init__(self, **kwargs):
        super(ImagesConfig, self).__init__(**kwargs)


class Images(datasets.GeneratorBasedBuilder):
    def __init__(self, **kwargs):
        self.DEFAULT_WRITER_BATCH_SIZE = 100
        super(Images, self).__init__(**kwargs)

    def _split_generators(self, dl_manager: datasets.DownloadManager):
        meta_data = {}
        with open(os.path.join(self.config.data_dir, "meta_data.json"), "r") as f:
            meta_data = json.load(f)
        data = []
        if (
            self.config.name == "similar_pairs"
        ):
            for image1_path in meta_data:
                for image2_path, similarity in meta_data[image1_path]["similar_images"]:
                    data.append(
                        (
                            image1_path,
                            image2_path,
                            similarity,
                        )
                    )
        elif self.config.name == "image_prompt_pairs":
            for image_path in meta_data:
                data.append(image_path, meta_data[image_path]["prompt"])
        print("data size:", len(data))
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={"split": datasets.Split.TRAIN, "data": data},
            )
        ]

    BUILDER_CONFIGS = [
        ImagesConfig(
            name="similar_pairs",
            description="simliar pair dataset,item is a pair of similar images",
        ),
        ImagesConfig(
            name="image_prompt_pairs",
            description="image prompt pairs",
        ),
    ]

    def _info(self):
        if self.config.name == "similar_pairs":
            return datasets.DatasetInfo(
                features=datasets.Features(
                    {
                        "image1": datasets.features.Image(),
                        "image1_path": datasets.Value("string"),
                        "image2": datasets.features.Image(),
                        "image2_path": datasets.Value("string"),
                        "similarity": datasets.Value("float32"),
                    }
                )
            )
        elif self.config.name == "image_prompt_pairs":
            return datasets.DatasetInfo(
                features=datasets.Features(
                    {
                        "image": datasets.features.Image(),
                        "image_path": datasets.features.Value("string"),
                        "prompt": datasets.Value("string"),
                    }
                )
            )

    def _generate_examples(self, split, data):
        if self.config.name == "similar_pairs":
            for image1_path, image2_path, similarity in data:
                yield image1_path + ":" + image2_path, {
                    "image1": Image.open(
                        os.path.join(self.config.data_dir, image1_path)
                    ),
                    "image1_path": image1_path,
                    "image2": Image.open(
                        os.path.join(self.config.data_dir, image2_path)
                    ),
                    "image2_path": image2_path,
                    "similarity": similarity,
                }

load dataset script

from datasets import load_dataset
ds = load_dataset(
    "/home/aihao/workspace/DeepLearningContent/datasets/images",
    "similar_pairs",
    "/home/aihao/workspace/DeepLearningContent/datasets/images",
    split="train",
)

1 reply

mariosasko Sep 7, 2023
Collaborator

What does the print("data size:", len(data)) line print?

Also, you can remove the Image.open calls in _generate_examples and return the image paths instead to make it faster.

aihao2000 · 2023-09-07T05:47:29Z

aihao2000
Sep 7, 2023
Author

@mariosasko It will output "data size: 126454". But calling "Image.open" in _generate_examples will make training faster, right? And I wanted to put the data processing part in _generate_examples

0 replies

aihao2000 · 2023-09-09T07:20:11Z

aihao2000
Sep 9, 2023
Author

@mariosasko I tested it again on ubuntu, which turned out to be wsl. It's incredibly fast, but it seems to be slower to train. I'm confused

0 replies

aihao2000 · 2023-09-09T07:36:19Z

aihao2000
Sep 9, 2023
Author

@mariosasko It's fine again now. It doesn't matter. thanks

0 replies

aihao2000 · 2023-09-09T07:48:21Z

aihao2000
Sep 9, 2023
Author

@mariosasko Does it precompute an 400k image pairs in a few seconds? It's incredibly fast. The original code took me about 24 hours for a 100k image pairs

0 replies

WillPowellUk · 2024-10-17T18:24:15Z

WillPowellUk
Oct 17, 2024

Hi @mariosasko and @aihao2000.

I have looked through this solution and I am still confused of how to obtain an efficient solution for an online example.

I have a minimal reproduceable example below, where I have modified the generator function to accept a list of shards and utilize a load_shard_dataset function for this.

from datasets import Dataset, load_dataset
import os

def load_shard_dataset(shard_num):
    base_url = "https://huggingface.co/datasets/jackyhate/text-to-image-2M/resolve/main/data_512_2M/data_{i:06d}.tar"
    # Generate the URL for the specified shard
    url = base_url.format(i=shard_num)
    # Load the specific shard as the dataset
    dataset = load_dataset("webdataset", data_files={"train": [url]}, split="train", streaming=True)
    return dataset

# Create the map function to modify the dataset
def map_function(example):
    return {
        "query": example["json"]["prompt"],
    }

# Apply the transformation in a generator function
def generator(shards):
    for shard in shards:
        ds = load_shard_dataset(shard)
        for example in ds:
                yield map_function(example)

# Use from_generator to process
shards = [shard for shard in range(30)]
print("Generating dataset")
ds_on_disk = Dataset.from_generator(generator, gen_kwargs={"shards": shards}, num_proc=os.cpu_count())

print("Pushing to Hugging Face Hub")
ds_on_disk.push_to_hub("mixedbread-ai/omni-retrieval-dataset-william", "docmatix-ir-processed", private=True)

The error Setting num_proc from 8 back to 1 for the train split to disable multiprocessing as it only contains one shard. is removed however, its efficiency is worse than if i were to utilize a single load_dataset and utilize one core only.

def load_original_dataset(num_shards):
    base_url = "https://huggingface.co/datasets/jackyhate/text-to-image-2M/resolve/main/data_512_2M/data_{i:06d}.tar"
    urls = [base_url.format(i=i) for i in range(num_shards)]
    dataset = load_dataset("webdataset", data_files={"train": urls}, split="train", streaming=True)
    return dataset

For some reason, this is slower than if i were to stream the entire dataset (all shards) an it defaults to num_proc==1. Please could you advise on how to make this efficient? Thanks in advance!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up "Generating train split" #6205

{{title}}

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to speed up "Generating train split" #6205

aihao2000 Sep 4, 2023

Replies: 8 comments · 2 replies

mariosasko Sep 4, 2023 Collaborator

aihao2000 Sep 6, 2023 Author

mariosasko Sep 6, 2023 Collaborator

aihao2000 Sep 6, 2023 Author

mariosasko Sep 7, 2023 Collaborator

aihao2000 Sep 7, 2023 Author

aihao2000 Sep 9, 2023 Author

aihao2000 Sep 9, 2023 Author

aihao2000 Sep 9, 2023 Author

WillPowellUk Oct 17, 2024

aihao2000
Sep 4, 2023

Replies: 8 comments 2 replies

mariosasko
Sep 4, 2023
Collaborator

aihao2000
Sep 6, 2023
Author

mariosasko Sep 6, 2023
Collaborator

aihao2000
Sep 6, 2023
Author

mariosasko Sep 7, 2023
Collaborator

aihao2000
Sep 7, 2023
Author

aihao2000
Sep 9, 2023
Author

aihao2000
Sep 9, 2023
Author

aihao2000
Sep 9, 2023
Author

WillPowellUk
Oct 17, 2024