Speech recognition is a fun task. A lot of API resources are available in market today which makes it easier for user to opt for one or another. However, when it comes to audio files like processing lengthy audio files then this becomes quite challenging.I have used Google Speech to Text API for performing this operation.
( Use Google Chrome/Microsoft Edge for viewing the demo)
Speech2Text-Demo.mp4
Google Speech to text has three types of API requests based on audio content:
The audio file content should be approximately 1 minute to make a synchronous request. In this type of request, the user does not have to upload the data to Google cloud. This provides the flexibility to users to store the audio file in their local computer or server and reference the API to get the text.
The audio file content should be approximately 480 minutes(8 hours). In this type of request, the user have to upload their data to Google cloud. Something that I am using here.)
It is suitable for streaming data where the user is talking to microphone directly and needs to get it transcribed. This type of request is apt for chatbots. Again, the streaming data should be approximately a minute for this type of request.
-
Before we begin, we need to do some initial setup for setting up the API client and storing the necessary credentials details which you would be needing later. Please follow this link
-
Once we create the API client, the next step is to create a storage bucket..
My methodology for converting speech to text:
- Importing the necessary packages.
- Audio file encoding. You can read about it here.
- Audio file specifications One other limitation is that the API does not support stereo audio files. So we need to convert a stereo file to mono file before using the API. In addition, we also have to provide the audio frame rate for the file. I already implemented a function in the code to convert the audio files to .wav format.
- Upload files to Google storage In order to perform asynchronous request the file is uploaded to google cloud.
- Delete files in Google storage Once the speech to text operation is completed, the file can be deleted from Google cloud to avoid unnecessary costs.
- Transcribe Convert the speech to plain text and save them as separate transcripts(text files). A sample transcript looks like this:
Speaker Diarization is a process of distinguishing speakers in an audio file. I Google speech to text API to perform speaker diarization which is given as a separate script. The final transcripts generated by Google after speaker diarization looks like below.