Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[.NET microphone app] Audio from playback of AI interpreted as user input #21

Open
Ben-Pattinson opened this issue Oct 4, 2024 · 3 comments

Comments

@Ben-Pattinson
Copy link

Ben-Pattinson commented Oct 4, 2024

Please provide us with the following information:

This issue is for a: (mark with an x)

- [x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Using a desktop PC with speakers and a mic, NOT a headset, say something then watch the chaos unfold.
The AI mistakes it's own reply for your speech, then interrupts itself and replies to it's own reply. Again and again. This lasted about 10 seconds before it said good-by to itself and stopped.

Any log messages given by the failure

Expected/desired behavior

As per the OpenAI app with the advanced voice, the AI should be able to differentiate between it's own voice and someone interrupting it.

OS and Version?

Windows 10, but it would probably be on anything

Versions

It's the .net console implementation. That has the problem.

Mention any other details that might be useful

This probably has only been tested / considered with a headset situation. That has value, but for those of us working from home, many of us have invested in decent mic/speaker setups, to avoid the pain of headsets. Both the app on the phone and the playground both work fine with open mics. So this is possible to solve.


Thanks! We'll be in touch soon.

@trrwilson
Copy link
Member

trrwilson commented Oct 4, 2024

Thanks, @Ben-Pattinson; it looks like this may be a limitation with NAudio's cross-platform integration with Windows's built-in AEC. I'll look into whether there's a good mitigation to make cancellation kick in appropriately without needing to specifically target Windows; if any astute readers have better audio abstractions, contributions are greatly welcomed!

It's also possible to turn the voice detection threshold up a bit to mitigate (TurnDetectionOptions on ConversationSessionOptions), but at some point that's not going to be adequate for true far-field use.

@trrwilson trrwilson changed the title Audio from playback of AI interpreted as user input [.NET microphone app] Audio from playback of AI interpreted as user input Oct 4, 2024
@tlaukkanen
Copy link

Most likely not only .NET issue. I'm also having the same chaos unfolding when running with Python on Linux, Raspberry Pi 4 together with mics on WM8960 Audio HAT and connected speakers. I was trying with these turn_detection settings:

turn_detection=ServerVAD(type="server_vad", threshold=0.5, prefix_padding_ms=200, silence_duration_ms=200)

Tried with various options like:

turn_detection=ServerVAD(type="server_vad", threshold=0.8, prefix_padding_ms=1000, silence_duration_ms=2000)

...but it's still picking up its' own voice as input and starts to babble with itself.

@tlaukkanen
Copy link

...ok, didn't think this through 😄 Most likely related to hardware setup then library itself. I should check for example the Pulseaudio echo cancellation settings for this to work :) Not sure if there is something similar on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants