model_refactor (#571) #572

torzdf · 2019-01-02T10:44:28Z

This is alpha! It is getting close to merging to staging, but bear in mind that:
Some items won't work.
Some items will work.
Some will stay the same.
Some will be changed.

Do not rely on anything in this branch until it has been merged to master. That said, you are welcome to test and report bugs. If reporting bugs please provide a crash report.

This PR significantly refactors the training part of Faceswap.

New models
Support for the following models added:

dfaker (@dfaker)
dfl h128
villain. A very resource intensive model by @VillainGuy
GAN has been removed with a view to look at adding GAN v2.2 down the line.
lowmem removed, but you can access the same functionality by enabling the 'lowmem' option in the config.ini for original model

Config ini files
Config files for each section will be generated on the first section run, or when running the GUI and will be placed in /<faceswap folder>/config/. These config files contains customizable options for each of the plugins and some global options. They are also accessible from the "Edit" menu in the GUI.

Converters have been re-written (see #574)

Known Bugs:
[GUI] Preview not working on Windows?
[Postponed - minor issue] Training takes a long time to start?

Todo:

Confirm warp-to-landmarks works as expected
[Postponed] [GUI] Resume last session option
[Postponed] Check for bk files if model files don't exist
[Postponed] GAN v2.2 port (maybe)
[Postponed] 'Ping-pong'** training option?
~~[GUI] Auto switch to tab on open recent~~
~~[GUI] Read session type from saved file~~
~~[GUI] Analysis tab, allow filtering by loss type~~
~~[GUI] Remove graph animation and replace with refresh button~~
~~Read loss in gui from model loss file~~
~~Store loss to file~~
~~parallel model saving~~
~~Reinstate OHR RC4Fix~~
~~TensorBoard support~~
~~Add dfaker "landmarks based warping" option to all models~~
~~Tweak converter~~
~~Update config to delete old items as well as insert new items~~
~~Confirm timelapse working~~
~~Cleanup preview for masked training~~
~~Add masks to all models~~
~~Add coverage option to all models~~
~~[converters] histogram currently non-functional.. working on mask / image interactions~~
~~paramatize size and padding across the code~~
~~merge @kvrooman PRs~~
~~Add dfaker mask to converts. [Cancelled]. Too similar to facehull to be worth it~~
~~Standardise NN Blocks~~
~~Fix for Backwards compatibility~~
~~Converters for new models (Consolidation of converters & refactor #574)~~
~~Input shape. Decide which can be configured and which must be static~~
~~Load input shapes from state file for saved models~~
~~expand out state file (Add current model config to this and load from here)~~
~~Save model definition~~
~~merge RC4_fix into original_hires~~
~~Backup state file with model~~
~~Improve GUI CPU handling for graph~~
~~Config options in GUI~~
~~Model corruption protection~~

Detail
A lot of the code has been standardized for all models, so they now all share the same loading/saving/training/preview functions. NN Blocks and other training functions have been separated out into their own libraries so that they can be used in multiple models. This should help to enable easier development of models by using/adding different objects to the lib/model data store.

Abandoned

[on advice] Add adjustable input sizes to all models

commits:

original model to new structure
IAE model to new structure
OriginalHiRes to new structure
Fix trainer for different resolutions
Initial config implementation
Configparse library added
improved training data loader
dfaker model working
Add logging to training functions
Non blocking input for cli training
Add error handling to threads. Add non-mp queues to queue_handler
Improved Model Building and NNMeta
refactor lib/models
training refactor. DFL H128 model Implementation
Dfaker - use hashes
Move timelapse. Remove perceptual loss arg
Update INSTALL.md. Add logger formatting. Update Dfaker training
DFL h128 partially ported

* original model to new structure * IAE model to new structure * OriginalHiRes to new structure * Fix trainer for different resolutions * Initial config implementation * Configparse library added * improved training data loader * dfaker model working * Add logging to training functions * Non blocking input for cli training * Add error handling to threads. Add non-mp queues to queue_handler * Improved Model Building and NNMeta * refactor lib/models * training refactor. DFL H128 model Implementation * Dfaker - use hashes * Move timelapse. Remove perceptual loss arg * Update INSTALL.md. Add logger formatting. Update Dfaker training * DFL h128 partially ported

kvrooman · 2019-01-02T14:54:39Z

Like the re-org, I have a few of the same losses in an offline repo. Along with the testing, I can add some of the loss codes

Enyakk · 2019-01-02T15:04:10Z

I am getting the following error when trying the dfaker model:
Configuration is: Windows 10, GTX 2080, CUDA 9.0. Tensorflow 1.12

torzdf · 2019-01-02T15:38:34Z

@kvrooman Thanks, would be appreciated
@Enyakk Please post crash log

Enyakk · 2019-01-02T15:44:52Z

Here is the complete log:
crash_report.2019.01.02.160142268336.log

torzdf · 2019-01-02T16:13:17Z

There is another bug in the loss function for dfaker, so this may be related. On the face of it, I don't see an issue with your setup.

Next on my list is to implement the mask for dfaker and fix the loss function, so hold tight until the next update (probably tomorrow) and try again.

* Remove old models. Add mask to dfaker

kvrooman · 2019-01-03T17:11:06Z

Looking at the i/o pre-processing into the gan model...
you should build the normalization input step into either the configs or alternatively into the gan v2.2 model file itself.
i.e. after inp = Input(shape=(input_size, input_size, self.num_chans_g_inp)) insert
varx = Lambda(lambda x: x * 2.0 - 1.0)(inp)

* DFL H128 Mask. Mask type selectable in config.

torzdf · 2019-01-03T18:33:15Z

@Enyakk I just randomly hit the same bug. Should be fixed now.

Creating Input Size config for models Will be used downstream in converters. Also name change of image_shape to input_shape to clarify ( for future models with potentially different output_shapes)

…into train_refactor

Includes auto-calculation of proper padding shapes, input_shapes, output_shapes Flag included in config now

…into train_refactor

DUZszyi · 2019-02-08T01:57:49Z

Hi I took this branch for a spin. I have been using (a fork of) dfaker's repo for a while and I wanted to check this project out. My fork didn't touch the model architecture at all so I figured I could use my weights files on the dfaker model of your branch.

Let me first say that I love the refactor done in this branch! When I last checked out master a few months ago I quickly gave up. Thanks for all that work.

That said, I did ran into some issues:

dfaker model is no longer compatible with "legacy dfaker" model. @kvrooman made some changes to the nn_blocks which caused this (listed below). Do we want to keep compatibility with legacy models (I would like that)? I assume the changes to the model are made with good reason, so should we then have a model 'dfaker original' and another 'dfaker kvrooman' for example?
At first I did not enable previews and to whole thing ran very slowly. Later when I realized that it would run at a normal speed when previews were enabled I may have found the cause for the slowness: monitor_console has a tight while loop without any sleep at all. This may eat up resources unnecessarily. I haven't confirmed though, if I do I'll follow up on that. (could be related to [Postponed - minor issue] Training takes a long time to start?)
This is not a big issue, but I was wondering if it would be at all possible to write a converter for dfaker's alignment.json format to the format used by this project? I took a quick look and I saw there is one converter for deepfacelabs format.
While I haven't done proper investigation, it looks like the image processing in the training data acquiring process/thread is not fast enough to keep a single 1080ti busy. Whatever the cause, the GPU is not fully utilized (it get to about 30 to 40%). Original dfakers project wasn't able to exhaust the GPU resources either, but there it was able to keep it at about 80% (on this same hardware configuration) CPU is a i7 3770, disk is fast and does not appear to be a bottleneck. CPU is not 100% utilized, is there a way to spawn more python processes to parallelise the training data processing?

Regarding the model changes (from point 1), listing changes I changed back to make the model compatible again:

There now is a res_block_follows param to upscale, when it is true the LeakyReLU gets added in the res_block. However upscale does add a PixelShuffler. This thus result in a reversed order of these layers compared to the original model. ie orginal upscale conv2d -> leaky_re_lu -> pixel_shuffler, model in this branch conv2d -> pixel_shuffler -> leaky_re_lu (c9d6698)
With change to LeakyReLU from upscale to res_block, the alpha changed from 0.1 to 0.2. (c9d6698)
Added a Scale layer (62f2b6f)
Removal of Bias in res_block's conv2d layers (268ccf2)

I've only started looking into this codebase today, so I apologize if I missed anything and I don't want to step on anyone's toes here, just wanting to share some thoughts while I have them.

Please let me know your thoughts, thanks!

torzdf · 2019-02-08T02:55:43Z

Thanks for the feedback! To answer some of your points.

In an ideal world dfaker would be model compatible with the original @dfaker model. My main goal was to keep it as 'vanilla' as possible, whilst extending functionality where possible. The main purpose of this refactor is to standardize as much as possible, and make any resource used in one model available for all existing models and any new models. Unfortunately some of those ideals don't necessarily play nicely with each other, so I will always choose to move towards standardized beyond maintaining custom compatibility. That said @kvrooman would be better placed to comment on the reasoning behind the changes to the nn_blocks.

When you say legacy dfaker? Do you mean earlier versions in this branch? If so, unfortunately we won't maintain backwards compatibility. Anything and everything in this branch is subject to change until it gets merged to staging (hopefully very soon).

Whilst "While True" loops aren't particularly great practice, they also don't generally eat too many CPU cycles, so it shouldn't be too much of an issue here. There is definitely an issue with the feeders, and the plan is to move A and B into their own processes, as everything is competing for single threaded CPU time at the moment. I have also noticed that dfaker feeds particularly slowly, and I will investigate why. I've decided to put it on the backburner for now as, whilst it isn't great, it isn't model breaking, and moving to multiprocesses is likely to involve a fairly hefty rewrite to keep everything thread-safe. It is high on the list once we've got this migrated into master though.

I will look into the possibility of adding a converter for dfaker's alignments files.

DUZszyi · 2019-02-08T03:17:52Z

Thanks for the answers, that clears up some things I was wondering.

Where I said "legacy dfaker" I was referring to the models defined in the original df repo. I have quite some weights files that match this model, which I would love to re-use with this project. They have many hours of training in them and in my experience it works quite well to re-use existing weights from decoders as a crude form of transfer learning.

Regarding the required resources for the dfaker feeder, if it is similar to the original dfaker code with its warping and matching of similar landmarks I can imagine it would be slower than others. I understand this gets a lower priority compared to getting this merged into master. In the mean time I will dig into this part of the code base and see if is either some low hanging fruit or if I can start getting the feeding multiprocess, at the very least get an idea on what needs to be done there.

Thanks!

kvrooman · 2019-02-08T15:53:26Z

some notes on the items you highlighted.

there was a focus on improving performance & better stability for models, especially when some training instability errors crept up. I generally applied some typical practices in resnet models to our code. realize this may make some back compatability issues with legacy models. you could use a weight loader to load your old weights in the updated model arch. as the layers are still all the same. we don't do this is our code, but I've done it myself for other models

---There now is a res_block_follows param to upscale, when it is true the LeakyReLU gets moved into in the res_block. However upscale does add a PixelShuffler. This thus result in a reversed order of these layers compared to the original model. ie orginal upscale conv2d -> leaky_re_lu -> pixel_shuffler, model in this branch conv2d -> pixel_shuffler -> leaky_re_lu (c9d6698)

multiplying by a constant ( as in RELU ) and then reshaping vs. reshaping with the same method and then multiplying by the same constant won't change any values. the order of operations won't affect the result as Pixel shuffler is just a fancy reshape function and involves no math. advantage is better flow through the res_block with a pre-activation style residual.

---With change to LeakyReLU from upscale to res_block, the alpha changed from 0.1 to 0.2. (c9d6698)

ideally the alpha of the leaky Relu should be the same throughout the whole model and be fine-tuned...i.e. is .105 better than .10. that being said, the current code base uses .1 and .2 in the res block. I left the res block alpha as it was, but we can at least keep the first pre-activation relu as the same .1, and consider moving the rest to .1 as well
---Added a Scale layer (62f2b6f)
this was related to stability issues in some models where they would have exploding gradients. adding a learnable scale layer with the multiplier initially set at zero or a small value helps with stability and speed of training the preceding residual block. old model's weights in the residual block would have internally learned this scaling factor

---Removal of Bias=False in res_block's conv2d layers

traditionally, residual blocks will always use Batch Normalization after the convolution. BN has a bias adder internally so the bias in the preceding convolution is usually removed as superfluous to speed up calculation. we don't use BN as it worsens / stops the identity swapping, so adding the bias back in to the conv adds more expressive power as seen in every other convolution in the model

DUZszyi · 2019-02-08T23:18:10Z

Thanks for the explanations on the these changes, makes a lot of sense now.

I am interested in the loading of weights into a different model arch. There already is a convert_legacy_weights function in place which uses load_weights. From what I read in the docs one could pass by_name:

By default, the architecture is expected to be unchanged. To load weights into a different architecture (with some layers in common), use by_name=True to load only those layers with the same name.

but that will not work because the layer names are not only different, layers with the same name are used in different places. ie a conv2d_10 in the new model is in a different location than the conv2d_10 of the old model. Do you have any suggestion on how to approach this?

Would I need to make a mapping between old names and new names of corresponding layers and the layer.set_weights each layer individually?

Thanks!

torzdf mentioned this pull request Jan 2, 2019

[Draft] Adding DFaker model plugin #271

Closed

Add mask to dfaker (#573)

f0a044f

* Remove old models. Add mask to dfaker

dfl mask. Make masks selectable in config (#575)

5d21061

* DFL H128 Mask. Mask type selectable in config.

torzdf and others added 11 commits January 3, 2019 21:00

remove gan_v2_2

d68c001

Creating Input Size config for models

860403a

Creating Input Size config for models Will be used downstream in converters. Also name change of image_shape to input_shape to clarify ( for future models with potentially different output_shapes)

Add mask loss options to config

c248d4b

Merge branch 'train_refactor' of https://github.com/deepfakes/faceswap …

24e0f9a

…into train_refactor

MTCNN options to config.ini. Remove GAN config. Update USAGE.md

3a3b190

Add sliders for numerical values in GUI

e5b52d2

Add config plugins menu to gui. Validate config

a75281d

Only backup model if loss has dropped. Get training working again

fae1c92

bugfixes

70e93d6

Standardise loss printing

be6dbf0

GUI idle cpu fixes. Graph loss fix.

6acd3a0

torzdf mentioned this pull request Jan 10, 2019

GUI uses a lot of CPU ressources (some fixes included) #521

Closed

rgeorgi mentioned this pull request Jan 10, 2019

network_type in convert_multi_gpu #583

Closed

torzdf added 6 commits January 10, 2019 12:54

mutli-gpu logging bugfix

07f9da2

Merge branch 'staging' into train_refactor

a5d9bb0

backup state file

15d412d

Crash protection: Only backup if both total losses have dropped

2400fd5

Port OriginalHiRes_RC4 to train_refactor (OriginalHiRes)

3817f16

Merge branch 'staging' into train_refactor

a54a39a

torzdf and others added 13 commits February 3, 2019 01:38

Merge branch 'train_refactor' of https://github.com/deepfakes/faceswap …

9357926

…into train_refactor

Speed-up of res_block stability

b08b920

Automated Reflection Padding

568cce6

Reflect Padding as a training option

d37ab8f

Includes auto-calculation of proper padding shapes, input_shapes, output_shapes Flag included in config now

rest of reflect padding

7d48ab5

Move TB logging to cli. Session info to state file

6597492

Merge branch 'train_refactor' of https://github.com/deepfakes/faceswap …

40b8226

…into train_refactor

Add session iterations to state file

0363a5f

Add recent files to menu. GUI code tidy up

4da4142

[GUI] Fix recent file list update issue

fb631a5

Add correct loss names to TensorBoard logs

42608be

Update live graph to use TensorBoard and remove animation

d4efa15

Fix analysis tab. GUI optimizations

61c7eda

torzdf added 2 commits February 8, 2019 02:32

Analysis Graph popup to Tensorboard Logs

3af6e60

Merge branch 'staging' into train_refactor

e929754

[GUI] Bug fix for graphing for models with hypens in name

af49107

torzdf added 2 commits February 8, 2019 16:18

[GUI] Correctly split loss to tabs during training

c32be6e

[GUI] Add loss type selection to analysis graph

9303491

torzdf added 5 commits February 9, 2019 01:44

Fix store command name in recent files. Switch to correct tab on open

60b58da

[GUI] Disable training graph when 'no-logs' is selected

21d2834

Merge branch 'master' into train_refactor

8707936

Fix graphing race condition

a768d6b

rename original_hires model to unbalanced

8bcf59f

torzdf merged commit cd00859 into staging Feb 9, 2019

torzdf deleted the train_refactor branch February 9, 2019 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_refactor (#571) #572

model_refactor (#571) #572

torzdf commented Jan 2, 2019 •

edited

Loading

kvrooman commented Jan 2, 2019

Enyakk commented Jan 2, 2019 •

edited

Loading

torzdf commented Jan 2, 2019

Enyakk commented Jan 2, 2019

torzdf commented Jan 2, 2019

kvrooman commented Jan 3, 2019

torzdf commented Jan 3, 2019 •

edited

Loading

DUZszyi commented Feb 8, 2019 •

edited

Loading

torzdf commented Feb 8, 2019

DUZszyi commented Feb 8, 2019

kvrooman commented Feb 8, 2019 •

edited

Loading

DUZszyi commented Feb 8, 2019 •

edited

Loading

model_refactor (#571) #572

model_refactor (#571) #572

Conversation

torzdf commented Jan 2, 2019 • edited Loading

kvrooman commented Jan 2, 2019

Enyakk commented Jan 2, 2019 • edited Loading

torzdf commented Jan 2, 2019

Enyakk commented Jan 2, 2019

torzdf commented Jan 2, 2019

kvrooman commented Jan 3, 2019

torzdf commented Jan 3, 2019 • edited Loading

DUZszyi commented Feb 8, 2019 • edited Loading

torzdf commented Feb 8, 2019

DUZszyi commented Feb 8, 2019

kvrooman commented Feb 8, 2019 • edited Loading

DUZszyi commented Feb 8, 2019 • edited Loading

torzdf commented Jan 2, 2019 •

edited

Loading

Enyakk commented Jan 2, 2019 •

edited

Loading

torzdf commented Jan 3, 2019 •

edited

Loading

DUZszyi commented Feb 8, 2019 •

edited

Loading

kvrooman commented Feb 8, 2019 •

edited

Loading

DUZszyi commented Feb 8, 2019 •

edited

Loading