Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model_refactor (#571) #572

Merged
merged 157 commits into from
Feb 9, 2019
Merged

model_refactor (#571) #572

merged 157 commits into from
Feb 9, 2019

Conversation

torzdf
Copy link
Collaborator

@torzdf torzdf commented Jan 2, 2019

This is alpha! It is getting close to merging to staging, but bear in mind that:
Some items won't work.
Some items will work.
Some will stay the same.
Some will be changed.

Do not rely on anything in this branch until it has been merged to master. That said, you are welcome to test and report bugs. If reporting bugs please provide a crash report.

This PR significantly refactors the training part of Faceswap.

New models
Support for the following models added:

  • dfaker (@dfaker)

  • dfl h128

  • villain. A very resource intensive model by @VillainGuy

  • GAN has been removed with a view to look at adding GAN v2.2 down the line.

  • lowmem removed, but you can access the same functionality by enabling the 'lowmem' option in the config.ini for original model

Config ini files
Config files for each section will be generated on the first section run, or when running the GUI and will be placed in /<faceswap folder>/config/. These config files contains customizable options for each of the plugins and some global options. They are also accessible from the "Edit" menu in the GUI.

Converters have been re-written (see #574)

Known Bugs:
[GUI] Preview not working on Windows?
[Postponed - minor issue] Training takes a long time to start?

Todo:

  • Confirm warp-to-landmarks works as expected
  • [Postponed] [GUI] Resume last session option
  • [Postponed] Check for bk files if model files don't exist
  • [Postponed] GAN v2.2 port (maybe)
  • [Postponed] 'Ping-pong'** training option?
  • [GUI] Auto switch to tab on open recent
  • [GUI] Read session type from saved file
  • [GUI] Analysis tab, allow filtering by loss type
  • [GUI] Remove graph animation and replace with refresh button
  • Read loss in gui from model loss file
  • Store loss to file
  • parallel model saving
  • Reinstate OHR RC4Fix
  • TensorBoard support
  • Add dfaker "landmarks based warping" option to all models
  • Tweak converter
  • Update config to delete old items as well as insert new items
  • Confirm timelapse working
  • Cleanup preview for masked training
  • Add masks to all models
  • Add coverage option to all models
  • [converters] histogram currently non-functional.. working on mask / image interactions
  • paramatize size and padding across the code
  • merge @kvrooman PRs
  • Add dfaker mask to converts. [Cancelled]. Too similar to facehull to be worth it
  • Standardise NN Blocks
  • Fix for Backwards compatibility
  • Converters for new models (Consolidation of converters & refactor #574)
  • Input shape. Decide which can be configured and which must be static
  • Load input shapes from state file for saved models
  • expand out state file (Add current model config to this and load from here)
  • Save model definition
  • merge RC4_fix into original_hires
  • Backup state file with model
  • Improve GUI CPU handling for graph
  • Config options in GUI
  • Model corruption protection

Detail
A lot of the code has been standardized for all models, so they now all share the same loading/saving/training/preview functions. NN Blocks and other training functions have been separated out into their own libraries so that they can be used in multiple models. This should help to enable easier development of models by using/adding different objects to the lib/model data store.

Abandoned

  • [on advice] Add adjustable input sizes to all models

commits:

  • original model to new structure

  • IAE model to new structure

  • OriginalHiRes to new structure

  • Fix trainer for different resolutions

  • Initial config implementation

  • Configparse library added

  • improved training data loader

  • dfaker model working

  • Add logging to training functions

  • Non blocking input for cli training

  • Add error handling to threads. Add non-mp queues to queue_handler

  • Improved Model Building and NNMeta

  • refactor lib/models

  • training refactor. DFL H128 model Implementation

  • Dfaker - use hashes

  • Move timelapse. Remove perceptual loss arg

  • Update INSTALL.md. Add logger formatting. Update Dfaker training

  • DFL h128 partially ported

* original model to new structure

* IAE model to new structure

* OriginalHiRes to new structure

* Fix trainer for different resolutions

* Initial config implementation

* Configparse library added

* improved training data loader

* dfaker model working

* Add logging to training functions

* Non blocking input for cli training

* Add error handling to threads. Add non-mp queues to queue_handler

* Improved Model Building and NNMeta

* refactor lib/models

* training refactor. DFL H128 model Implementation

* Dfaker - use hashes

* Move timelapse. Remove perceptual loss arg

* Update INSTALL.md. Add logger formatting. Update Dfaker training

* DFL h128 partially ported
@kvrooman
Copy link
Contributor

kvrooman commented Jan 2, 2019

Like the re-org, I have a few of the same losses in an offline repo. Along with the testing, I can add some of the loss codes

@Enyakk
Copy link

Enyakk commented Jan 2, 2019

I am getting the following error when trying the dfaker model:
Configuration is: Windows 10, GTX 2080, CUDA 9.0. Tensorflow 1.12

@torzdf
Copy link
Collaborator Author

torzdf commented Jan 2, 2019

@kvrooman Thanks, would be appreciated
@Enyakk Please post crash log

@Enyakk
Copy link

Enyakk commented Jan 2, 2019

Here is the complete log:
crash_report.2019.01.02.160142268336.log

@torzdf
Copy link
Collaborator Author

torzdf commented Jan 2, 2019

There is another bug in the loss function for dfaker, so this may be related. On the face of it, I don't see an issue with your setup.

Next on my list is to implement the mask for dfaker and fix the loss function, so hold tight until the next update (probably tomorrow) and try again.

* Remove old models. Add mask to dfaker
@kvrooman
Copy link
Contributor

kvrooman commented Jan 3, 2019

Looking at the i/o pre-processing into the gan model...
you should build the normalization input step into either the configs or alternatively into the gan v2.2 model file itself.
i.e. after inp = Input(shape=(input_size, input_size, self.num_chans_g_inp)) insert
varx = Lambda(lambda x: x * 2.0 - 1.0)(inp)

* DFL H128 Mask. Mask type selectable in config.
@torzdf
Copy link
Collaborator Author

torzdf commented Jan 3, 2019

@Enyakk I just randomly hit the same bug. Should be fixed now.

@DUZszyi
Copy link

DUZszyi commented Feb 8, 2019

Hi I took this branch for a spin. I have been using (a fork of) dfaker's repo for a while and I wanted to check this project out. My fork didn't touch the model architecture at all so I figured I could use my weights files on the dfaker model of your branch.

Let me first say that I love the refactor done in this branch! When I last checked out master a few months ago I quickly gave up. Thanks for all that work.

That said, I did ran into some issues:

  1. dfaker model is no longer compatible with "legacy dfaker" model. @kvrooman made some changes to the nn_blocks which caused this (listed below). Do we want to keep compatibility with legacy models (I would like that)? I assume the changes to the model are made with good reason, so should we then have a model 'dfaker original' and another 'dfaker kvrooman' for example?
  2. At first I did not enable previews and to whole thing ran very slowly. Later when I realized that it would run at a normal speed when previews were enabled I may have found the cause for the slowness: monitor_console has a tight while loop without any sleep at all. This may eat up resources unnecessarily. I haven't confirmed though, if I do I'll follow up on that. (could be related to [Postponed - minor issue] Training takes a long time to start?)
  3. This is not a big issue, but I was wondering if it would be at all possible to write a converter for dfaker's alignment.json format to the format used by this project? I took a quick look and I saw there is one converter for deepfacelabs format.
  4. While I haven't done proper investigation, it looks like the image processing in the training data acquiring process/thread is not fast enough to keep a single 1080ti busy. Whatever the cause, the GPU is not fully utilized (it get to about 30 to 40%). Original dfakers project wasn't able to exhaust the GPU resources either, but there it was able to keep it at about 80% (on this same hardware configuration) CPU is a i7 3770, disk is fast and does not appear to be a bottleneck. CPU is not 100% utilized, is there a way to spawn more python processes to parallelise the training data processing?

Regarding the model changes (from point 1), listing changes I changed back to make the model compatible again:

  • There now is a res_block_follows param to upscale, when it is true the LeakyReLU gets added in the res_block. However upscale does add a PixelShuffler. This thus result in a reversed order of these layers compared to the original model. ie orginal upscale conv2d -> leaky_re_lu -> pixel_shuffler, model in this branch conv2d -> pixel_shuffler -> leaky_re_lu (c9d6698)
  • With change to LeakyReLU from upscale to res_block, the alpha changed from 0.1 to 0.2. (c9d6698)
  • Added a Scale layer (62f2b6f)
  • Removal of Bias in res_block's conv2d layers (268ccf2)

I've only started looking into this codebase today, so I apologize if I missed anything and I don't want to step on anyone's toes here, just wanting to share some thoughts while I have them.

Please let me know your thoughts, thanks!

@torzdf
Copy link
Collaborator Author

torzdf commented Feb 8, 2019

Thanks for the feedback! To answer some of your points.

In an ideal world dfaker would be model compatible with the original @dfaker model. My main goal was to keep it as 'vanilla' as possible, whilst extending functionality where possible. The main purpose of this refactor is to standardize as much as possible, and make any resource used in one model available for all existing models and any new models. Unfortunately some of those ideals don't necessarily play nicely with each other, so I will always choose to move towards standardized beyond maintaining custom compatibility. That said @kvrooman would be better placed to comment on the reasoning behind the changes to the nn_blocks.

When you say legacy dfaker? Do you mean earlier versions in this branch? If so, unfortunately we won't maintain backwards compatibility. Anything and everything in this branch is subject to change until it gets merged to staging (hopefully very soon).

Whilst "While True" loops aren't particularly great practice, they also don't generally eat too many CPU cycles, so it shouldn't be too much of an issue here. There is definitely an issue with the feeders, and the plan is to move A and B into their own processes, as everything is competing for single threaded CPU time at the moment. I have also noticed that dfaker feeds particularly slowly, and I will investigate why. I've decided to put it on the backburner for now as, whilst it isn't great, it isn't model breaking, and moving to multiprocesses is likely to involve a fairly hefty rewrite to keep everything thread-safe. It is high on the list once we've got this migrated into master though.

I will look into the possibility of adding a converter for dfaker's alignments files.

@DUZszyi
Copy link

DUZszyi commented Feb 8, 2019

Thanks for the answers, that clears up some things I was wondering.

Where I said "legacy dfaker" I was referring to the models defined in the original df repo. I have quite some weights files that match this model, which I would love to re-use with this project. They have many hours of training in them and in my experience it works quite well to re-use existing weights from decoders as a crude form of transfer learning.

Regarding the required resources for the dfaker feeder, if it is similar to the original dfaker code with its warping and matching of similar landmarks I can imagine it would be slower than others. I understand this gets a lower priority compared to getting this merged into master. In the mean time I will dig into this part of the code base and see if is either some low hanging fruit or if I can start getting the feeding multiprocess, at the very least get an idea on what needs to be done there.

Thanks!

@kvrooman
Copy link
Contributor

kvrooman commented Feb 8, 2019

some notes on the items you highlighted.

there was a focus on improving performance & better stability for models, especially when some training instability errors crept up. I generally applied some typical practices in resnet models to our code. realize this may make some back compatability issues with legacy models. you could use a weight loader to load your old weights in the updated model arch. as the layers are still all the same. we don't do this is our code, but I've done it myself for other models

---There now is a res_block_follows param to upscale, when it is true the LeakyReLU gets moved into in the res_block. However upscale does add a PixelShuffler. This thus result in a reversed order of these layers compared to the original model. ie orginal upscale conv2d -> leaky_re_lu -> pixel_shuffler, model in this branch conv2d -> pixel_shuffler -> leaky_re_lu (c9d6698)

  • multiplying by a constant ( as in RELU ) and then reshaping vs. reshaping with the same method and then multiplying by the same constant won't change any values. the order of operations won't affect the result as Pixel shuffler is just a fancy reshape function and involves no math. advantage is better flow through the res_block with a pre-activation style residual.

---With change to LeakyReLU from upscale to res_block, the alpha changed from 0.1 to 0.2. (c9d6698)

  • ideally the alpha of the leaky Relu should be the same throughout the whole model and be fine-tuned...i.e. is .105 better than .10. that being said, the current code base uses .1 and .2 in the res block. I left the res block alpha as it was, but we can at least keep the first pre-activation relu as the same .1, and consider moving the rest to .1 as well
    ---Added a Scale layer (62f2b6f)
  • this was related to stability issues in some models where they would have exploding gradients. adding a learnable scale layer with the multiplier initially set at zero or a small value helps with stability and speed of training the preceding residual block. old model's weights in the residual block would have internally learned this scaling factor

---Removal of Bias=False in res_block's conv2d layers

  • traditionally, residual blocks will always use Batch Normalization after the convolution. BN has a bias adder internally so the bias in the preceding convolution is usually removed as superfluous to speed up calculation. we don't use BN as it worsens / stops the identity swapping, so adding the bias back in to the conv adds more expressive power as seen in every other convolution in the model

@DUZszyi
Copy link

DUZszyi commented Feb 8, 2019

Thanks for the explanations on the these changes, makes a lot of sense now.

I am interested in the loading of weights into a different model arch. There already is a convert_legacy_weights function in place which uses load_weights. From what I read in the docs one could pass by_name:

By default, the architecture is expected to be unchanged. To load weights into a different architecture (with some layers in common), use by_name=True to load only those layers with the same name.

but that will not work because the layer names are not only different, layers with the same name are used in different places. ie a conv2d_10 in the new model is in a different location than the conv2d_10 of the old model. Do you have any suggestion on how to approach this?

Would I need to make a mapping between old names and new names of corresponding layers and the layer.set_weights each layer individually?

Thanks!

@torzdf torzdf merged commit cd00859 into staging Feb 9, 2019
@torzdf torzdf deleted the train_refactor branch February 9, 2019 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants