-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto rotate frames containing faces that are not vertical to fix crappy insight face bug. #364
Conversation
roop/ProcessMgr.py
Outdated
rotated_bbox = self.rotate_bbox_clockwise(original_face.bbox, frame) | ||
frame = rotate_clockwise(frame) | ||
target_face = self.get_rotated_target_face(rotated_bbox, frame) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a # here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was a little confused when I saw this comment because I thought I removed this. Turns out I forgot that one last commit. Fixed it now.
I tried, sure does reduce flickering. Impressive |
Right! Wish I had more energy to improve it further, but it's pretty cool. |
Have you ever thought what happens if we change the original refacer white square masking by a green one? Woudln't it be much easier to detect? So much overexposed and natural white everywhere, green almost nowhere. That thing mask_h_inds, mask_w_inds = np.where(img_matte==255) I have some issue with something silly that might take you less time to solve than to read this. |
Interesting note about the white square masking box. I don't know about how the actual faceswap works as I didn't need to read that part of the code to achieve this, but what you say makes sense on an intuitive level, since that's exactly why they use green screens in video production. Can I ask what problems you've encountered? How do you know when you're encountering such a problem and that's the cause of it? I may have been encountering some of those issues too and just not been able to spot it. I find another case that really annoys the heck out of me are frames where the subject is looking downwards and it's just not able to reliably match up the shape of the face/head in a way that makes sense. I think in theory it's possibly to just drop all of those types of those shitty frames if you were to train a classifier which takes the detected face as input and spits out a "yeah this isn't going to work, so just skip the frame" rather than trying to do low quality swap. Though to be honest, that's the approach I could have taken in this PR too, but I haven't trained my own custom model that isn't dreambooth, so I just tried to work out if I could do it based on heuristics from the data coming back from the inferenced face detection. Maybe that means the same tactic might work for that case too... like for example if the distance between the landmarks that run from the forehead to the chin are too squished together then you can assume the subject is looking down and just skip the frame. Hmm... |
Im trying to better the experience and results on mac. If drawthings.ai (mac stable difuson app that's really good) can leverage GPU+CPU+Neural engin, add its own metal optimizations, manage memory without being full-retard in putting everything in the flexible_size_partition_for_swap thats only limited by available disk (a issue with python sometimesà, and not being coreml exclusive while proposing a coreml model temporary conversion.... The white rectangle stroke me because I used to have an issue with a visible square; and also because the majority of my videos kinda have a white background so I found it silly. Side note, the processor puts a 1pix dark border to delimit. Problems I find are systemic when the top of the face is out of view, like when there is a close up and the top of the forehead gets out of view: no face detected. I think detection should focus on eyes and nose rather than overall shape, and then use some sort of multi point trapezoid correction to morph source onto target (or vice versa), focusing mostly on the eyes, nose and mouth. You managed the double 90 rotations, why not a 360 rotation? maybe there is somethink that allow for a quick complete rotation. Or add some entropy and do a few rotations, and taking the best. And I talked about needing a human and multiple steps? I wrote more than I expected |
Interesting discussion you got here and a nice sounding PR. I'm hoping to find the time for a merge this week. |
I was reading the PR with intense fascination! C0unt, you made one heck of an amazing workspace. I use it very often and it's very impressive. I wish I had the skill to work with some of these PR's I've read around this project. I managed to get the TensorRT provider working on my RTX 4050 by making some adjustments. I just get the occasional black box instead of a face. But I'm trying my best to learn and follow along with you guys. Thank you very much for looking out for my SSDs and providing and in-memory processing option. |
return frame | ||
if roop.globals.no_face_action == 2: | ||
if roop.globals.no_face_action == skip_frame: | ||
#This only works with in-mem processing, as it simply skips the frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we duplicate previous frame in lieu of using unprocessed frrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the conclusion that I came to when originally thinking about this question was that you could, and it would work well for a subset of cases, but it would likely result in other scenarios where there are long periods where the video appears to be frozen because this happens many times in a row. Think about the behavior when there no face in the frame and is that the desired behavior under that circumstance?
Alternatively you could add that as an option. But its a quick and dirty fix with limited utility that adds complexity to the code making it even harder to work with than it already is.
I think what's really needed is to track what actions were taken during processing, and use that info to do a second pass, where the second pass employs heuristics that solve known edge cases.
Ideally you don't want the user to have to navigate an increasingly complex web of options in order to process their video successfully, you just kind of want them to hit process and have it work.
@LatentLoser I finally had the time to look into your PR and I have some questions:
This will skip the swapped face if it is very different from the reference face. This however is likely to fail quite often, especially with faces looking sideways, even more if the target face morphology is very different and finally when the face similarity value used is the strict default value. Another drawback, the similarity test is performed after all of the post-processing. All of the Enhancers change quite a lot of the face identity. IMO there should be at least a second setting just for this or some percentage like 20% tolerance to the regular one, While testing I had perfectly swapped faces where the similarity value was > 1.0 which usually means that it is a different person and it would be skipped by your code. Enough nitpicking. I'm happy when people try to improve this actively, thanks again! |
That dependency might just be left over from earlier attempt to accomplish what I was trying to and I missed cleaning it up in the PR? The rotation is the thing that makes the massive difference, it's a well known bug in insightface. It's particularly obvious if you have a face that's at a 90 degree angle as the swapper often severely mangles the face generating some real nightmare fuel and this is due to the 2d landmark positions getting all screwed up. If you haven't naturally happened upon videos that contain this problem, you can simulate using any video you have by just using ffmpeg to first rotate it so that the face appears horizontally and then try running it with and without the autorotation fix. The similarity comparison thing, I'm not sure I really tested how much of a difference that actually made, but it seemed like a logical thing to do. If it will cause a problem elsewhere then it's probably fine or better to remove it. It's the autorotation that does all the heavy lifting in terms of improvement. As for the actual rotation part, yeah potentially could be done differently, this is just the way I thought of at the time and it worked, so I was happy. It could almost certainly be improved or expanded, but what's there at least works and makes a massive difference. |
- New auto rotation of horizontal faces, fixing bad landmark positions (expanded on ![PR 364](#364)) - Simple VR Option for stereo Images/Movies, best used in selected face mode - Added RestoreFormer Enhancer - https://github.com/wzhouxiff/RestoreFormer - Bumped up package versions for onnx/Torch etc.
This PR is kind of huge and contains both a big feature upgrade and several major bug fixes. I'm not planning on making any major changes to this PR in order to help get it merged, so it will just be as-is, but I did still want to share the code in the hopes it helps the roop community move forward from this stupid bug in insight face that can't deal properly with faces that don't appear vertically in frames.
I developed this weeks ago, and from memory the several bug fixes also in it include:
The feature: Auto-rotating frames containing faces that are not vertical
The feature upgrade itself is auto rotating frames that contain a face that is not vertically oriented in order to make it vertically oriented during the faceswap, and then rotating it back to its original position afterwards. Normally, if you try to perform a faceswap with the insight face model on a video that has faces in a scene where someone is lying down or doing a confused dog pose with their head tilted sideways, then the face swap fails and generates horrible garbled results.
The options you need to select to use this feature:
How the bug in insight face works
How the bug works is that the insight face model returns the correct location for the bounding box of the face, but it tends to really screw up the position of the 2d landmarks in the face, and this is why the faceswap goes horribly wrong and becomes garbled. A number of users in the community have naturally happened upon the solution of simply rotating the video first before rooping it, and then rotating it back once it has been rooped. This works, but is a giant pain in the ass to prepare in cases where such a scene occurs part way through a video. Hence, I had a look to see if this process could be automated, and to my astonishment it actually can.
How the algorithm to fix it works:
Notes
The works almost perfectly with a few caveats, and could almost certainly be furthered improved, I just didn't care to at the time. The caveats are that it sometimes fails to detect the face in a frame, which results in the face flickering back and forth in the final video sometimes. The results I got were pretty usable, but to make them basically perfect by default what I did was just select the frame to skip frames where no face was detected. This works near perfectly, but results in some video artifacts where the video looks a little bit choppy in parts, and of course it won't contain any frames that don't contain a faces, which may be undesirable, but that's just the trade-off at this point.
I'm sure this can be vastly improved to get near perfect results on basically any video without having to do any work to prepare the video in any way shape or form. I don't really like working in Python and find untested, dynamically typed codebases frustrating in general, and the structure of this codebase isn't well factored enough to easily make the subsequent changes needed to really take it to the next level, so I kind of ran out of patience with it and put it down at "good enough", but It could be improved to the point where the problem is basically completely solved.
Some ideas to push this much further:
There are basically two sub-problems remaining to solve here.
I think to solve these problems you probably need to do something like shifting away from attempting to process each frame in isolation, but actually keep track of what processing actions took place on what frames, and do a second pass at the end where you apply some heuristics to look for cases where the model has likely made a mistake. For example, for each frame that was skipped check if the frames before and after contained a face that was rotated and swapped, and if they did then assume it was skipped by mistake and instead interpolate the missing frames.
In theory if you develop a reliable algorithm to detect and fix this bug in insight face then that would allow you to build a pipeline that builds a dataset that could be used to train a model that augments insightface and fixes this problem without requiring all the code here.
Further to that, if someone takes this and turns it into a ComfyUI node, and that allows us to build a workflow with it where we can feed it a video, have that video rooped reliably, then apply an img2img pass using a LoRA of our rooped subject to greatly enhance the quality of the face, then we could theoretically pump massive volumes of videos through that to build a dataset to train a better model which can do high resolution faceswaps. Probably anyway, right? I'm not a machine learning engineer, but it appears that's how this sort of thing works.