-
Notifications
You must be signed in to change notification settings - Fork 832
M.E.AI.Abstractions - Speech to Text Abstraction #5838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
M.E.AI.Abstractions - Speech to Text Abstraction #5838
Conversation
@dotnet-policy-service agree company="Microsoft" |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=938431&view=codecoverage-tab |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=942860&view=codecoverage-tab |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945523&view=codecoverage-tab |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945918&view=codecoverage-tab |
...ibraries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionChoice.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionChoice.cs
Outdated
Show resolved
Hide resolved
...icrosoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionClientExtensions.cs
Outdated
Show resolved
Hide resolved
...icrosoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionClientExtensions.cs
Outdated
Show resolved
Hide resolved
...ries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionCompletion.cs
Outdated
Show resolved
Hide resolved
...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs
Outdated
Show resolved
Hide resolved
...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs
Outdated
Show resolved
Hide resolved
...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/StreamExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/StreamExtensions.cs
Outdated
Show resolved
Hide resolved
cc: @Swimburger for visibility. Feedback is appreciated. Thanks! |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=960384&view=codecoverage-tab |
src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/ErrorContent.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/ErrorContent.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/ErrorContent.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/ISpeechToTextClient.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/ISpeechToTextClient.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientMetadata.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextOptions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponse.cs
Outdated
Show resolved
Hide resolved
...es/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdateExtensions.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdateKind.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.OpenAI/DataContentAsyncEnumerableStream.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponse.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdate.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdate.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponse.cs
Outdated
Show resolved
Hide resolved
...es/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdateExtensions.cs
Show resolved
Hide resolved
...es/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdateExtensions.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAIClientExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI/SpeechToText/LoggingSpeechToTextClient.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thanks!
ADR - Introducing Speech To Text Abstraction
Problem Statement
The project requires the ability to transcribe and translate speech audios to text. The project is a proof of concept to validate the
ISpeechToTextClient
abstraction against different transcription and translation APIs providing a consistent interface for the project to use.Note
The names used for the proposed abstractions below are open and can be changed at any time given a bigger consensus.
Considered Options
Option 1: Generic Multi Modality Abstraction
IModelClient<TInput, TOutput>
(Discarded)This option would have provided a generic abstraction for all models, including audio transcription. However, this would have made the abstraction too generic and brought up some questioning during the meeting:
Usability Concerns:
The generic interface could make the API less intuitive and harder to use, as users would not be guided towards the specific options they need. 1
Naming and Clarity:
Generic names like "complete streaming" do not convey the specific functionality, making it difficult for users to understand what the method does. Specific names like "transcribe" or "generate song" would be clearer. 2
Implementation Complexity:
Implementing a generic interface would still require concrete implementations for each permutation of input and output types, which could be complex and cumbersome. 3
Specific Use Cases:
Different services have specific requirements and optimizations for their modalities, which may not be effectively captured by a generic interface. 4
Future Proofing vs. Practicality:
While a generic interface aims to be future-proof, it may not be practical for current needs and could lead to an explosion of permutations that are not all relevant. 5
Separation of Streaming and Non-Streaming:
There was a concern about separating streaming and non-streaming interfaces, as it could complicate the API further. 6
Option 2: Speech to Text Abstraction
ISpeechToTextClient
(Preferred)This option would provide a specific abstraction for audio transcription and audio translations, which would be more intuitive and easier to use. The specific interface would allow for better optimization and customization for each service.
Initially was thought on having different interfaces one for streaming and another for non-streaming api, but after some discussion, it was decided to have a single interface similar to what we have in
IChatClient
.Note
Further modality abstractions will mostly follow this as a standard moving forward.
Inputs:
Stream audioSpeechStream
, allows for streaming audio data contents to the service.This API enables usage of large audio files or real-time transcription (without having to load the full file in-memory) and can easily be extended to support different audio input types like a single
DataContent
or aStream
instance.Supporting scenarios like:
DataContent type input extension
SpeechToTextOptions
, analogous to existingChatOptions
it allows providing additional options on both Streaming and Non-Streaming APIs for the service, such as language, model, or other parameters.ModelId
is a unique identifier for the model to use for transcription.SpeechLanguage
is the language of the audio content.Azure Cognitive Speech
- Supported languagesSpeechSampleRate
is the sample rate of the audio content. Real-time speech to text generation requires a specific sample rate.Outputs:
SpeechToTextResponse
, For non-streaming API analogous to existingChatResponse
it provides the text generated result and additional information about the speech response.ResponseId
is a unique identifier for the response.ModelId
is a unique identifier for the model used for transcription.StartTime
andEndTime
represents both Timestamps from where the text started and ended relative to the speech audio length.i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the transcription should start from 30 seconds forward, same for the end time.
Note
TimeSpan
is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.SpeechToTextResponseUpdate
, For streaming API, analogous to existingChatResponseUpdate
it provides the speech to text result as multiple chunks of updates, that represents the content generated as well as any important information about the processing.ResponseId
is a unique identifier for the speech to text response.StartTime
andEndTime
for the given transcribed chunk represents the timestamp where it starts and ends relative to the audio length.i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the transcription chunk will flush with the StartTime from 30 seconds forward until the last word of the chunk which will represent the end time.
Note
TimeSpan
is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.Contents
is a list ofAIContent
objects that represent the transcription result. Most use cases will have oneTextContent
object that can be retrieved from theText
property similarly as aText
inChatMessage
.Kind
is astruct
similarly toChatRole
General Update Kinds:
SessionOpen
- When the transcription session is open.TextUpdating
- When the speech to text is in progress, without waiting for silence. (Preferably for UI updates)Different apis used different names for this, ie:
PartialTranscriptReceived
SegmentData
RecognizingSpeech
TextUpdated
- When a speech to text block is complete after a small period of silence.Different API names for this, ie:
FinalTranscriptReceived
RecognizedSpeech
SessionClose
- When the transcription session is closed.Error
- When an error occurs during the speech to text process.Errors during the streaming can happen, and normally won't block the ongoing process, but can provide more detailed information about the error. For this reason instead of throwing an exception, the error can be provided as part of the ongoing streaming using a dedicated content
ErrorContent
.Specific API categories: