Skip to content

Commit 134f085

Browse files
authored
feat: Add Google Speech to Text API Document Loader (#12298)
- Add Document Loader for Google Speech to Text - Similar Structure to [Assembly AI Document Loader][1] [1]: https://python.langchain.com/docs/integrations/document_loaders/assemblyai
1 parent 52c194e commit 134f085

File tree

6 files changed

+395
-3
lines changed

6 files changed

+395
-3
lines changed

docs/api_reference/guide_imports.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Google Speech-to-Text Audio Transcripts\n",
8+
"\n",
9+
"The `GoogleSpeechToTextLoader` allows to transcribe audio files with the [Google Cloud Speech-to-Text API](https://cloud.google.com/speech-to-text) and loads the transcribed text into documents.\n",
10+
"\n",
11+
"To use it, you should have the `google-cloud-speech` python package installed, and a Google Cloud project with the [Speech-to-Text API enabled](https://cloud.google.com/speech-to-text/v2/docs/transcribe-client-libraries#before_you_begin).\n",
12+
"\n",
13+
"- [Bringing the power of large models to Google Cloud’s Speech API](https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api)"
14+
]
15+
},
16+
{
17+
"cell_type": "markdown",
18+
"metadata": {},
19+
"source": [
20+
"## Installation & setup\n",
21+
"\n",
22+
"First, you need to install the `google-cloud-speech` python package.\n",
23+
"\n",
24+
"You can find more info about it on the [Speech-to-Text client libraries](https://cloud.google.com/speech-to-text/v2/docs/libraries) page.\n",
25+
"\n",
26+
"Follow the [quickstart guide](https://cloud.google.com/speech-to-text/v2/docs/sync-recognize) in the Google Cloud documentation to create a project and enable the API."
27+
]
28+
},
29+
{
30+
"cell_type": "code",
31+
"execution_count": null,
32+
"metadata": {},
33+
"outputs": [],
34+
"source": [
35+
"%pip install google-cloud-speech\n"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"metadata": {},
41+
"source": [
42+
"## Example\n",
43+
"\n",
44+
"The `GoogleSpeechToTextLoader` must include the `project_id` and `file_path` arguments. Audio files can be specified as a Google Cloud Storage URI (`gs://...`) or a local file path.\n",
45+
"\n",
46+
"Only synchronous requests are supported by the loader, which has a [limit of 60 seconds or 10MB](https://cloud.google.com/speech-to-text/v2/docs/sync-recognize#:~:text=60%20seconds%20and/or%2010%20MB) per audio file."
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": 2,
52+
"metadata": {},
53+
"outputs": [],
54+
"source": [
55+
"from langchain.document_loaders import GoogleSpeechToTextLoader\n",
56+
"\n",
57+
"project_id = \"<PROJECT_ID>\"\n",
58+
"file_path = \"gs://cloud-samples-data/speech/audio.flac\"\n",
59+
"# or a local file path: file_path = \"./audio.wav\"\n",
60+
"\n",
61+
"loader = GoogleSpeechToTextLoader(project_id=project_id, file_path=file_path)\n",
62+
"\n",
63+
"docs = loader.load()\n"
64+
]
65+
},
66+
{
67+
"cell_type": "markdown",
68+
"metadata": {},
69+
"source": [
70+
"Note: Calling `loader.load()` blocks until the transcription is finished."
71+
]
72+
},
73+
{
74+
"cell_type": "markdown",
75+
"metadata": {},
76+
"source": [
77+
"The transcribed text is available in the `page_content`:"
78+
]
79+
},
80+
{
81+
"cell_type": "code",
82+
"execution_count": null,
83+
"metadata": {},
84+
"outputs": [],
85+
"source": [
86+
"docs[0].page_content\n"
87+
]
88+
},
89+
{
90+
"cell_type": "markdown",
91+
"metadata": {},
92+
"source": [
93+
"```\n",
94+
"\"How old is the Brooklyn Bridge?\"\n",
95+
"```"
96+
]
97+
},
98+
{
99+
"cell_type": "markdown",
100+
"metadata": {},
101+
"source": [
102+
"The `metadata` contains the full JSON response with more meta information:"
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": null,
108+
"metadata": {},
109+
"outputs": [],
110+
"source": [
111+
"docs[0].metadata\n"
112+
]
113+
},
114+
{
115+
"cell_type": "markdown",
116+
"metadata": {},
117+
"source": [
118+
"```json\n",
119+
"{\n",
120+
" 'language_code': 'en-US',\n",
121+
" 'result_end_offset': datetime.timedelta(seconds=1)\n",
122+
"}\n",
123+
"```"
124+
]
125+
},
126+
{
127+
"cell_type": "markdown",
128+
"metadata": {},
129+
"source": [
130+
"## Recognition Config\n",
131+
"\n",
132+
"You can specify the `config` argument to use different speech recognition models and enable specific features.\n",
133+
"\n",
134+
"Refer to the [Speech-to-Text recognizers documentation](https://cloud.google.com/speech-to-text/v2/docs/recognizers) and the [`RecognizeRequest`](https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognizeRequest) API reference for information on how to set a custom configuation.\n",
135+
"\n",
136+
"If you don't specify a `config`, the following options will be selected automatically:\n",
137+
"\n",
138+
"- Model: [Chirp Universal Speech Model](https://cloud.google.com/speech-to-text/v2/docs/chirp-model)\n",
139+
"- Language: `en-US`\n",
140+
"- Audio Encoding: Automatically Detected\n",
141+
"- Automatic Punctuation: Enabled"
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": 6,
147+
"metadata": {},
148+
"outputs": [],
149+
"source": [
150+
"from google.cloud.speech_v2 import AutoDetectDecodingConfig, RecognitionConfig, RecognitionFeatures\n",
151+
"from langchain.document_loaders import GoogleSpeechToTextLoader\n",
152+
"\n",
153+
"project_id = \"<PROJECT_ID>\"\n",
154+
"location = \"global\"\n",
155+
"recognizer_id = \"<RECOGNIZER_ID>\"\n",
156+
"file_path = \"./audio.wav\"\n",
157+
"\n",
158+
"config = RecognitionConfig(\n",
159+
" auto_decoding_config=AutoDetectDecodingConfig(),\n",
160+
" language_codes=[\"en-US\"],\n",
161+
" model=\"long\",\n",
162+
" features=RecognitionFeatures(\n",
163+
" enable_automatic_punctuation=False,\n",
164+
" profanity_filter=True,\n",
165+
" enable_spoken_punctuation=True,\n",
166+
" enable_spoken_emojis=True\n",
167+
" ),\n",
168+
" )\n",
169+
"\n",
170+
"loader = GoogleSpeechToTextLoader(\n",
171+
" project_id=project_id,\n",
172+
" location=location,\n",
173+
" recognizer_id=recognizer_id,\n",
174+
" file_path=file_path,\n",
175+
" config=config\n",
176+
")\n"
177+
]
178+
}
179+
],
180+
"metadata": {
181+
"kernelspec": {
182+
"display_name": ".venv",
183+
"language": "python",
184+
"name": "python3"
185+
},
186+
"language_info": {
187+
"codemirror_mode": {
188+
"name": "ipython",
189+
"version": 3
190+
},
191+
"file_extension": ".py",
192+
"mimetype": "text/x-python",
193+
"name": "python",
194+
"nbconvert_exporter": "python",
195+
"pygments_lexer": "ipython3",
196+
"version": "3.11.0"
197+
},
198+
"orig_nbformat": 4
199+
},
200+
"nbformat": 4,
201+
"nbformat_minor": 2
202+
}

docs/docs/integrations/platforms/google.mdx

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,28 @@ See a [usage example and authorizing instructions](/docs/integrations/document_l
8989
from langchain.document_loaders import GoogleDriveLoader
9090
```
9191

92+
### Speech-to-Text
93+
94+
> [Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text) is an audio transcription API powered by Google's speech recognition models.
95+
96+
This document loader transcribes audio files and outputs the text results as Documents.
97+
98+
First, we need to install the python package.
99+
100+
```bash
101+
pip install google-cloud-speech
102+
```
103+
104+
See a [usage example and authorizing instructions](/docs/integrations/document_loaders/google_speech_to_text).
105+
106+
```python
107+
from langchain.document_loaders import GoogleSpeechToTextLoader
108+
```
109+
92110
## Vector Store
93-
### Google Vertex AI Vector Search
111+
### Vertex AI Vector Search
94112

95-
> [Google Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview),
113+
> [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview),
96114
> formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale
97115
> low latency vector database. These vector databases are commonly
98116
> referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.

libs/langchain/langchain/document_loaders/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@
8787
from langchain.document_loaders.git import GitLoader
8888
from langchain.document_loaders.gitbook import GitbookLoader
8989
from langchain.document_loaders.github import GitHubIssuesLoader
90+
from langchain.document_loaders.google_speech_to_text import GoogleSpeechToTextLoader
9091
from langchain.document_loaders.googledrive import GoogleDriveLoader
9192
from langchain.document_loaders.gutenberg import GutenbergLoader
9293
from langchain.document_loaders.hn import HNLoader
@@ -267,6 +268,7 @@
267268
"GitbookLoader",
268269
"GoogleApiClient",
269270
"GoogleApiYoutubeLoader",
271+
"GoogleSpeechToTextLoader",
270272
"GoogleDriveLoader",
271273
"GutenbergLoader",
272274
"HNLoader",
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
from __future__ import annotations
2+
3+
from typing import TYPE_CHECKING, List, Optional
4+
5+
from langchain.docstore.document import Document
6+
from langchain.document_loaders.base import BaseLoader
7+
from langchain.utilities.vertexai import get_client_info
8+
9+
if TYPE_CHECKING:
10+
from google.cloud.speech_v2 import RecognitionConfig
11+
from google.protobuf.field_mask_pb2 import FieldMask
12+
13+
14+
class GoogleSpeechToTextLoader(BaseLoader):
15+
"""
16+
Loader for Google Cloud Speech-to-Text audio transcripts.
17+
18+
It uses the Google Cloud Speech-to-Text API to transcribe audio files
19+
and loads the transcribed text into one or more Documents,
20+
depending on the specified format.
21+
22+
To use, you should have the ``google-cloud-speech`` python package installed.
23+
24+
Audio files can be specified via a Google Cloud Storage uri or a local file path.
25+
26+
For a detailed explanation of Google Cloud Speech-to-Text, refer to the product
27+
documentation.
28+
https://cloud.google.com/speech-to-text
29+
"""
30+
31+
def __init__(
32+
self,
33+
project_id: str,
34+
file_path: str,
35+
location: str = "us-central1",
36+
recognizer_id: str = "_",
37+
config: Optional[RecognitionConfig] = None,
38+
config_mask: Optional[FieldMask] = None,
39+
):
40+
"""
41+
Initializes the GoogleSpeechToTextLoader.
42+
43+
Args:
44+
project_id: Google Cloud Project ID.
45+
file_path: A Google Cloud Storage URI or a local file path.
46+
location: Speech-to-Text recognizer location.
47+
recognizer_id: Speech-to-Text recognizer id.
48+
config: Recognition options and features.
49+
For more information:
50+
https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognitionConfig
51+
config_mask: The list of fields in config that override the values in the
52+
``default_recognition_config`` of the recognizer during this
53+
recognition request.
54+
For more information:
55+
https://cloud.google.com/python/docs/reference/speech/latest/google.cloud.speech_v2.types.RecognizeRequest
56+
"""
57+
try:
58+
from google.api_core.client_options import ClientOptions
59+
from google.cloud.speech_v2 import (
60+
AutoDetectDecodingConfig,
61+
RecognitionConfig,
62+
RecognitionFeatures,
63+
SpeechClient,
64+
)
65+
except ImportError as exc:
66+
raise ImportError(
67+
"Could not import google-cloud-speech python package. "
68+
"Please install it with `pip install google-cloud-speech`."
69+
) from exc
70+
71+
self.project_id = project_id
72+
self.file_path = file_path
73+
self.location = location
74+
self.recognizer_id = recognizer_id
75+
# Config must be set in speech recognition request.
76+
self.config = config or RecognitionConfig(
77+
auto_decoding_config=AutoDetectDecodingConfig(),
78+
language_codes=["en-US"],
79+
model="chirp",
80+
features=RecognitionFeatures(
81+
# Automatic punctuation could be useful for language applications
82+
enable_automatic_punctuation=True,
83+
),
84+
)
85+
self.config_mask = config_mask
86+
87+
self._client = SpeechClient(
88+
client_info=get_client_info(module="speech-to-text"),
89+
client_options=(
90+
ClientOptions(api_endpoint=f"{location}-speech.googleapis.com")
91+
if location != "global"
92+
else None
93+
),
94+
)
95+
self._recognizer_path = self._client.recognizer_path(
96+
project_id, location, recognizer_id
97+
)
98+
99+
def load(self) -> List[Document]:
100+
"""Transcribes the audio file and loads the transcript into documents.
101+
102+
It uses the Google Cloud Speech-to-Text API to transcribe the audio file
103+
and blocks until the transcription is finished.
104+
"""
105+
try:
106+
from google.cloud.speech_v2 import RecognizeRequest
107+
except ImportError as exc:
108+
raise ImportError(
109+
"Could not import google-cloud-speech python package. "
110+
"Please install it with `pip install google-cloud-speech`."
111+
) from exc
112+
113+
request = RecognizeRequest(
114+
recognizer=self._recognizer_path,
115+
config=self.config,
116+
config_mask=self.config_mask,
117+
)
118+
119+
if "gs://" in self.file_path:
120+
request.uri = self.file_path
121+
else:
122+
with open(self.file_path, "rb") as f:
123+
request.content = f.read()
124+
125+
response = self._client.recognize(request=request)
126+
127+
return [
128+
Document(
129+
page_content=result.alternatives[0].transcript,
130+
metadata={
131+
"language_code": result.language_code,
132+
"result_end_offset": result.result_end_offset,
133+
},
134+
)
135+
for result in response.results
136+
]

0 commit comments

Comments
 (0)