VAE training sample script by aandyw · Pull Request #3801 · huggingface/diffusers

aandyw · 2023-06-15T18:03:55Z

PR for Issue #3726

Todos

implement training loop for VAE
KL loss implementation
evaluate performance of VAE training
fix script to work for mixed precision
integration with a1111

HuggingFaceDocBuilderDev · 2023-06-15T18:10:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

aandyw · 2023-06-24T05:12:54Z

[06/24/2023] VAE fine-tuning runs successfully but will need to test/evaluate image results.

epitaque · 2023-07-24T03:08:51Z

examples/vae/README.md

+    --dataset_name="<DATASET_NAME>" \
+    --train_batch_size=1 \
+    --gradient_accumulation_steps=4 \
+    --gradient_checkpointing


Suggested change

--gradient_checkpointing

--gradient_checkpointing \

examples/vae/train_vae.py

zhuliyi0 · 2023-07-31T11:05:17Z

examples/vae/train_vae.py

+    with accelerator.main_process_first():
+        # Split into train/test
+        dataset = dataset["train"].train_test_split(test_size=args.test_samples)
+        # Set the training transforms
+        train_dataset = dataset["train"].with_transform(preprocess)
+        test_dataset = dataset["test"].with_transform(preprocess)


support loading test set from test_data_set folder

Suggested change

with accelerator.main_process_first():

# Split into train/test

dataset = dataset["train"].train_test_split(test_size=args.test_samples)

# Set the training transforms

train_dataset = dataset["train"].with_transform(preprocess)

test_dataset = dataset["test"].with_transform(preprocess)

with accelerator.main_process_first():

# Load test data from test_data_dir

if(args.test_data_dir is not None and args.train_data_dir is not None):

logger.info(f"load test data from {args.test_data_dir}")

test_dir = os.path.join(args.test_data_dir, "**")

test_dataset = load_dataset(

"imagefolder",

data_files=test_dir,

cache_dir=args.cache_dir,

)

# Set the training transforms

train_dataset = dataset["train"].with_transform(preprocess)

test_dataset = test_dataset["train"].with_transform(preprocess)

# Split into train/test

else:

dataset = dataset["train"].train_test_split(test_size=args.test_samples)

# Set the training transforms

train_dataset = dataset["train"].with_transform(preprocess)

test_dataset = dataset["test"].with_transform(preprocess)

zhuliyi0 · 2023-07-31T11:06:53Z

examples/vae/train_vae.py

+        type=int,
+        default=4,
+        help="Number of images to remove from training set to be used as validation.",
+    )


add new argument test_data_dir, for dedicated test data folder

Suggested change

)

)

parser.add_argument(

"--test_data_dir",

type=str,

default=None,

help=(

"If not None, will override test_samples arg and use data inside this dir as test dataset."

),

)

zhuliyi0 · 2023-07-31T11:08:43Z

examples/vae/train_vae.py

+        if tracker.name == "tensorboard":
+            np_images = np.stack([np.asarray(img) for img in images])
+            tracker.writer.add_images(
+                "Original (left) / Reconstruction (right)", np_images, epoch


change file name to be compatible with Windows

Suggested change

"Original (left) / Reconstruction (right)", np_images, epoch

"Original (left)-Reconstruction (right)", np_images, epoch

examples/vae/train_vae.py

zhuliyi0 · 2023-08-03T05:19:26Z

examples/vae/train_vae.py

+    progress_bar.set_description("Steps")
+
+    lpips_loss_fn = lpips.LPIPS(net="alex").to(accelerator.device)
+


Suggested change

#initial validation as baseline

with torch.no_grad():

log_validation(test_dataloader, vae, accelerator, weight_dtype, 0)

one validation before training start as baseline for comparison.

Co-authored-by: zhuliyi0 <48817897+zhuliyi0@users.noreply.github.com>

JamesHao-ml · 2023-08-09T01:47:10Z

examples/vae/train_vae.py

+                pred = vae.decode(z).sample
+
+                kl_loss = posterior.kl().mean()
+                mse_loss = F.mse_loss(pred, target, reduction="mean")


In original stable-diffusion repo and SDXL repo, the vae loss is averaged over batch dim, which means they are summed in channelheightwidth dims. Is this the correct way to average reconstruction loss?
https://github.com/CompVis/stable-diffusion/blob/21f890f9da3cfbeaba8e2ac3c425ee9e998d5229/ldm/modules/losses/contperceptual.py#L58

github-actions · 2023-09-02T15:03:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JunzheJosephZhu

lgtm

lavinal712 · 2024-12-25T07:08:15Z

Is there any progress now?

vae training script initial commit

8562a60

aandyw added 7 commits June 17, 2023 21:20

training phase for vae setup

9a42a7f

implemented vae training loop

925a5de

update

6642237

updated mle loss

08c5588

successful epoch with no mixed precision on adam optimizer

567022c

added accelerator trackers

22ff805

update

a008152

aandyw marked this pull request as ready for review June 24, 2023 19:13

aandyw added 11 commits June 24, 2023 15:54

update

326417d

added in wandb logging

dbf1284

update

d5fc43e

update

944b6c0

added eval for generative functionality

3d67ff3

update

9a37b5b

update

d10b56a

update

755dc06

fixed VAE training loss function - credit to ThibaultCastells

9c0413c

updated readme

9334d18

updated

f7a46a3

epitaque reviewed Jul 24, 2023

View reviewed changes

aandyw added 7 commits July 26, 2023 19:04

added normalization

b5abbc9

changed arg types from int to float

9c6caaa

update

447f775

removed crop transform

28f733f

added random crop

e7cb806

update

2ee9da6

updated lpips default

3563439

update

6aa6537

aandyw changed the title ~~[WIP] VAE training sample script~~ VAE training sample script Jul 27, 2023

zhuliyi0 reviewed Jul 31, 2023

View reviewed changes

examples/vae/train_vae.py Show resolved Hide resolved

zhuliyi0 reviewed Jul 31, 2023

View reviewed changes

zhuliyi0 reviewed Aug 3, 2023

View reviewed changes

examples/vae/train_vae.py Outdated Show resolved Hide resolved

zhuliyi0 reviewed Aug 3, 2023

View reviewed changes

aandyw and others added 2 commits August 5, 2023 18:39

Update examples/vae/train_vae.py

0c28c6c

Co-authored-by: zhuliyi0 <48817897+zhuliyi0@users.noreply.github.com>

added in torch no grad to resolve vram issue

55337fc

JamesHao-ml reviewed Aug 9, 2023

View reviewed changes

github-actions bot added the stale Issues that haven't received updates label Sep 2, 2023

github-actions bot closed this Sep 12, 2023

JunzheJosephZhu reviewed Aug 2, 2024

View reviewed changes

lavinal712 mentioned this pull request Jan 20, 2025

create a script to train autoencoderkl #10605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAE training sample script#3801

VAE training sample script#3801
aandyw wants to merge 29 commits intohuggingface:mainfrom
aandyw:vae-training

aandyw commented Jun 15, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 15, 2023

Uh oh!

aandyw commented Jun 24, 2023 •

edited

Loading

Uh oh!

epitaque Jul 24, 2023

Uh oh!

Uh oh!

zhuliyi0 Jul 31, 2023 •

edited

Loading

Uh oh!

zhuliyi0 Jul 31, 2023

Uh oh!

zhuliyi0 Jul 31, 2023

Uh oh!

Uh oh!

zhuliyi0 Aug 3, 2023

Uh oh!

JamesHao-ml Aug 9, 2023

Uh oh!

github-actions bot commented Sep 2, 2023

Uh oh!

JunzheJosephZhu left a comment

Uh oh!

lavinal712 commented Dec 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

-    )
+    )
+   parser.add_argument(
+        "--test_data_dir",
+        type=str,
+        default=None,
+        help=(
+            "If not None, will override test_samples arg and use data inside this dir as test dataset."
+        ),
+    )

	"Original (left) / Reconstruction (right)", np_images, epoch
	"Original (left)-Reconstruction (right)", np_images, epoch

		progress_bar.set_description("Steps")

		lpips_loss_fn = lpips.LPIPS(net="alex").to(accelerator.device)

+    #initial validation as baseline
+    with torch.no_grad():
+        log_validation(test_dataloader, vae, accelerator, weight_dtype, 0)

Conversation

aandyw commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 15, 2023

Uh oh!

aandyw commented Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epitaque Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhuliyi0 Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuliyi0 Jul 31, 2023

Choose a reason for hiding this comment

Uh oh!

zhuliyi0 Jul 31, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhuliyi0 Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

JamesHao-ml Aug 9, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 2, 2023

Uh oh!

JunzheJosephZhu left a comment

Choose a reason for hiding this comment

Uh oh!

lavinal712 commented Dec 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

aandyw commented Jun 15, 2023 •

edited

Loading

aandyw commented Jun 24, 2023 •

edited

Loading

zhuliyi0 Jul 31, 2023 •

edited

Loading