You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my scenario, applying single audio shift is not enough: sooner or later audios become out of sync at least due to
unpredictable frame drops in both tracks
unmatching overall average speed (often with higher pitch for faster audio)
Any interest in supporting such a scenario?
Any existing projects that try to accomplish this problem?
Any ideas what's the best way to implement it?
Naive idea for implementation:
do initial synchronization
until old dubbed audio ends
detect whether segment potentially contains voice (with something like silero-vad) or something non-silent/non-voiced (ideally, music segment)
somehow measure tempo difference between the old and new audio segments
if it's voice — recognize it (with something like whisper.cpp) and compare time differences of first and last word of the segment, between old and new audio segment
if it's something else — probably just compare differences of two most loud points of old and new audio segment
shrink/stretch (speedup/slowdown) the (old, dubbed in other language) audio segments (the possible analyzed non-silent/non-voiced segment and any next N segments)
repeat
Thanks!
The text was updated successfully, but these errors were encountered:
alopatindev
changed the title
[Request/Suggestion] Support unpredictable frame drops and unmatching speed/pitch
[Request/Suggestion] Support unpredictable frame drops and unmatching speed/pitch (drift correction)
Dec 29, 2023
Sorry for the late reply, and thanks for the suggestion!
Audalign currently has a "locality" feature, which breaks up audio files into segments and aligns based on the strength of the match between segments of the audio file (more info in wiki). This could be relatively easily used to stretch the audio files, but wouldn't handle frame drops.
It looks like AudioAlign's graph/feature is purely based on correlation? I don't have much time to work on this in the near future, but if it's an easy change I'd be happy to work on it. Or, I'd gladly accept pull requests!
silero-vad and whisper look like a neat idea for a new recognizer! For this case, would translated audio segments necessarily line up with word starts and ends? Would translated segments be viable as time markers, or would shrink/stretching have to be done based on the background?
I'm looking for a possibility to perform (potentially destructive) audio tracks synchronization from old (dubbed in different language) and remastered versions of movies.
In my scenario, applying single audio shift is not enough: sooner or later audios become out of sync at least due to
Any interest in supporting such a scenario?
Any existing projects that try to accomplish this problem?
Any ideas what's the best way to implement it?
Naive idea for implementation:
Thanks!
The text was updated successfully, but these errors were encountered: