You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We recently started using phashes for matching against StashDB. This has exposed bugs in the phash generation process. A broken or unparsable file will result in common phashes, which will then be matched to a random scene in StashDB that has had that same broken phash uploaded. We may not be able to fix the underlying ffmpeg issues, but we can work around them. During phash generation / use, we should validate to make sure the phash doesn't match a known bad phash ( results of solid color, color bars, etc.), and other phash validation rules.
Further investigation is still needed, but I've identified some facts already:
Real phashes should have a (roughly?) even amount of 1s and 0s
Many bad phashes 'look' strange. They have slow entropy.
If ffmpeg determines that the video duration is zero, the phash is almost always junk.
Known bad phashes so far:
(note, they may vary in the wild by 1-3 bits, so checks should check for a hamming-distance match)
Describe the bug
We recently started using phashes for matching against StashDB. This has exposed bugs in the phash generation process. A broken or unparsable file will result in common phashes, which will then be matched to a random scene in StashDB that has had that same broken phash uploaded. We may not be able to fix the underlying ffmpeg issues, but we can work around them. During phash generation / use, we should validate to make sure the phash doesn't match a known bad phash ( results of solid color, color bars, etc.), and other phash validation rules.
Further investigation is still needed, but I've identified some facts already:
Real phashes should have a (roughly?) even amount of 1s and 0s
Many bad phashes 'look' strange. They have slow entropy.
If ffmpeg determines that the video duration is zero, the phash is almost always junk.
Known bad phashes so far:
(note, they may vary in the wild by 1-3 bits, so checks should check for a hamming-distance match)
The text was updated successfully, but these errors were encountered: