How to avoid recomputation #6245

aihao2000 · 2023-09-17T04:19:03Z

aihao2000
Sep 17, 2023

I used map for this calculation. Expect it to be faster with with_transform. But it didn't work. And how do I guarantee not to recalculate? I found that just changing the num_proc parameter caused a recalculation.

image_transforms = transforms.Compose(
        [
            transforms.Resize(
                args.resolution, interpolation=transforms.InterpolationMode.BILINEAR
            ),
            transforms.CenterCrop(args.resolution),
            transforms.ToTensor(),
            transforms.Normalize([0.5], [0.5]),
        ]
    )

    def preprocess_train(examples):
        images = [image.convert("RGB") for image in examples[image_column]]
        images = [image_transforms(image) for image in images]

        return {
            "pixel_values": images,
        }

    with accelerator.main_process_first():
        if args.max_train_samples is not None:
            dataset["train"] = (
                dataset["train"]
                .shuffle(seed=args.seed)
                .select(range(args.max_train_samples))
            )
        # Set the training transforms
        if args.load_dataset_streaming:
            train_dataset = dataset["train"].map(
                preprocess_train,
                batched=True,
            )
            train_dataset = train_dataset.shuffle(seed=args.seed)
        else:
            if args.dataset_map:
                train_dataset = dataset["train"].map(
                    preprocess_train,
                    batch_size=args.train_batch_size,
                    batched=True,
                    num_proc=args.load_dataset_num_proc,
                )
            else:
                train_dataset = dataset["train"].with_transform(preprocess_train)

    print(type(train_dataset[0]['pixel_values']))

the output is <class 'list'>.I don't understand.

Answered by mariosasko

Sep 21, 2023

The default formatting returns the built-in types as values (lists, dictionaries, etc.). To get torch tensors, use .set_format("pt") on the dataset object.

Changing num_proc in many scenarios leads to a slightly different result (e.g., tokenization with truncation in the batched mode), which is why it requires re-computation, as we cannot be sure the result will be the same.

View full answer

aihao2000 · 2023-09-17T07:26:22Z

aihao2000
Sep 17, 2023
Author

I noticed that I need to use dataset.map and need to use torch.tensor conversion in collate_fn of dataloader. Now I want to know how to control whether recalculation is needed. I just change num_proc and it automatically triggers recalculation. They use same cache, right?

0 replies

mariosasko · 2023-09-21T13:15:11Z

mariosasko
Sep 21, 2023
Collaborator

The default formatting returns the built-in types as values (lists, dictionaries, etc.). To get torch tensors, use .set_format("pt") on the dataset object.

Changing num_proc in many scenarios leads to a slightly different result (e.g., tokenization with truncation in the batched mode), which is why it requires re-computation, as we cannot be sure the result will be the same.

0 replies

aihao2000 · 2023-09-21T13:35:37Z

aihao2000
Sep 21, 2023
Author

@mariosasko thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to avoid recomputation #6245

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to avoid recomputation #6245

aihao2000 Sep 17, 2023

Replies: 3 comments

aihao2000 Sep 17, 2023 Author

mariosasko Sep 21, 2023 Collaborator

aihao2000 Sep 21, 2023 Author

aihao2000
Sep 17, 2023

aihao2000
Sep 17, 2023
Author

mariosasko
Sep 21, 2023
Collaborator

aihao2000
Sep 21, 2023
Author