Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fire onSpeechEnd or onVadMisfire on pause #63

Merged
merged 8 commits into from
Dec 15, 2023
Merged

Conversation

HayatoYagi
Copy link
Contributor

Fix #30 based on #45

Currently, the last speech segment can be discarded on pause. This PR fixes the behavior to treat a pause as something similar to stopping speaking.

@HayatoYagi HayatoYagi requested a review from ricky0123 November 28, 2023 15:20
@HayatoYagi HayatoYagi self-assigned this Nov 28, 2023
@HayatoYagi HayatoYagi changed the base branch from update-test-site to master November 28, 2023 15:24
@HayatoYagi HayatoYagi changed the base branch from master to update-test-site November 28, 2023 15:25
@HayatoYagi
Copy link
Contributor Author

HayatoYagi commented Nov 28, 2023

I wanted to test this PR with my updated test site so the base branch is #62, but these two PRs have nothing to do with each other. I will rebase to master after #62 is addressed.

Base automatically changed from update-test-site to master December 2, 2023 17:50
@ricky0123
Copy link
Owner

Hi @HayatoYagi , thanks for opening this PR. This would be a breaking change, though, and so I would not want to merge it as is. I would be open to enabling this behavior via the method I suggested here, or really any way that doesn't change the default behavior.

@HayatoYagi
Copy link
Contributor Author

@ricky0123
In my opinion, the current behavior is a bug, and this change is acceptable. The difference in timing when pressing the pause button for just one frame, which is uncontrollable for the users, after finishing a long speech leads to the inconsistency whether entire speech is discarded or not. It creates significant confusion. I wonder if there are cases where the current behavior is desired. Additionally, I believe adding options indiscriminately has a negative impact on the developer experience. However, if you still prefer not to change the current behavior, I will explore ways to add an option based on your suggestion!

@HayatoYagi
Copy link
Contributor Author

This change does not require any adjustments on the calling side, so I would categorize it as a bug fix or a minor specification change rather than a "breaking change".
Actually, we were surprised and confused when we first observed this behavior, as the audio segment just before the pause was completely lost.

@ricky0123
Copy link
Owner

Hi @HayatoYagi

The difference in timing when pressing the pause button for just one frame, which is uncontrollable for the users, after finishing a long speech leads to the inconsistency whether entire speech is discarded or not.

I don't quite understand what you mean here, can you explain?

In my view, the pause button is something the user presses to stop all functionality associated with the VAD, not to clip and submit the current audio.

In any case, it's a substantial change from the default behavior, so I don't want to do it. Possibly a better solution than an extra parameter to the onSpeechEnd callback is to add a new option at initialization. Eg

myvad = await MicVAD.new({ submitUserSpeechOnPause: true })

or something similar. I agree with you that each option we add increases the complexity for developers, but this is clearly an important issue about which people have different intuitions, so it's an important option to add. And I do not want to change the default behavior right now. If we had enough of a community on Discord, I would maybe leave it to a poll, but we have few enough people on Discord that I wouldn't trust the poll to be representative. So let's just implement the initialization option unless you have a different idea.

@HayatoYagi
Copy link
Contributor Author

I don't quite understand what you mean here, can you explain?

1a. The user finishes speaking.
2a. Press the pause button after redemptionFrames have elapsed.
3a. onSpeechEnd is triggered.

1b. The user finishes speaking.
2b. Press the pause button when redemptionFrames-1 have elapsed.
3b. The entire last speech segment is discarded.

The difference between 2a and 2b is only a 1-frame timing distinction. While users shouldn't be conscious of the redemption frame, they are required to wait for the redemption frame to elapse before pressing the pause button.

@HayatoYagi
Copy link
Contributor Author

@ricky0123
I've added an option to enable this change.

If we had enough of a community on Discord, I would maybe leave it to a poll

How many people do you think is enough?

@ricky0123
Copy link
Owner

Thanks for your PR @HayatoYagi I'll try to look at it tomorrow.

@ricky0123 ricky0123 merged commit d34fcd6 into master Dec 15, 2023
@ricky0123
Copy link
Owner

Hi @HayatoYagi thanks for your contribution. I refactored it a little, added a couple of test pages, merged, and published it.

Thank you for clarifying what you meant about the redemption frames. I can see both perspectives. I guess it depends partially on why you think the user is pressing the pause button. For my use case, it was mostly something they would press if they were interrupted and wanted to discard the current speech segment (i.e. immediately stopping all functionality associated with the vad). I do agree that if you are using a long redemption period and they hit pause towards the end of it, it would produce an unexpected result from their perspective.

I haven't thought concretely about how many people I would want to participate in a poll. Hopefully, this PR will satisfy most people.

Thanks!

@antoine-lizee
Copy link
Contributor

Hello 👋

After testing, we are losing some capabilities compared to what we had in my fork.

The main thing is that we can't know at onSpeechEnd if it comes from a toggle or a natural redemption timeout.

  • In my implementation, I set vad.listening to false before the callback is executed, so we can use it as a proxy (if you're not listening, you got "stopped").
  • In your implementation, everything happens at the same time (because you trigger self.endSegment() directly)

This is important because I want to remove all the redemption frames in the natural case, and not in the forced-stop case.

We have 2 options:

  • Give visibility on this natural vs forced case, typically with something like earlyStop in onSpeechEnd()
  • Do the trim of the redemption frames in the VAD directly, which would make sense! We could even do it in both natural and forced stop cases, by removing all the redemption frames that have been measured, whenever we stop. (we probably want to keep n frames)

@ricky0123, @HayatoYagi, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pause VAD during speech not triggering onSpeechEnd or onVADMisfire
3 participants