Skip to content

Commit b481ca2

Browse files
committed
CrateDB: Documentation about Vector Store, Document Loader, and Memory
1 parent 94e5765 commit b481ca2

File tree

6 files changed

+1429
-1
lines changed

6 files changed

+1429
-1
lines changed

docs/docs/.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ node_modules/
44

55
.docusaurus
66
.cache-loader
7-
docs/api
7+
docs/api
8+
example.sqlite
Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# CrateDB Document Loader\n",
8+
"\n",
9+
"> [CrateDB] is capable of performing both vector and lexical search.\n",
10+
"> It is built on top of the Apache Lucene library, talks SQL,\n",
11+
"> is PostgreSQL-compatible, and scales like Elasticsearch.\n",
12+
"\n",
13+
"This notebook covers how to get started with the CrateDB document loader.\n",
14+
"\n",
15+
"The CrateDB document loader is based on [SQLAlchemy], and uses LangChain's\n",
16+
"SQLDatabaseLoader. It loads the result of a database query with one document\n",
17+
"per row.\n",
18+
"\n",
19+
"[CrateDB]: https://github.com/crate/crate\n",
20+
"[SQLAlchemy]: https://www.sqlalchemy.org/\n",
21+
"\n",
22+
"## Overview\n",
23+
"\n",
24+
"The `CrateDBLoader` class helps you get your unstructured content from CrateDB\n",
25+
"into LangChain's `Document` format.\n",
26+
"\n",
27+
"You must provide an SQLAlchemy-compatible connection string, and a query\n",
28+
"expression in SQL format. \n",
29+
"\n",
30+
"### Integration details\n",
31+
"\n",
32+
"| Class | Package | Local | Serializable | JS support|\n",
33+
"|:-----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------| :---: | :---: | :---: |\n",
34+
"| [CrateDBLoader](https://python.langchain.com/api_reference/cratedb/document_loaders/langchain_cratedb.document_loaders.cratedb.CrateDBLoader.html) | [langchain_box](https://python.langchain.com/api_reference/cratedb/index.html) | ✅ | ❌ | ❌ | \n",
35+
"### Loader features\n",
36+
"| Source | Document Lazy Loading | Async Support\n",
37+
"| :---: | :---: | :---: | \n",
38+
"| CrateDBLoader | ✅ | ❌ | \n",
39+
"\n",
40+
"## Setup\n",
41+
"\n",
42+
"You can run CrateDB Community Edition on your premises, or you can use CrateDB Cloud.\n",
43+
"\n",
44+
"### Credentials\n",
45+
"\n",
46+
"You will supply credentials through a regular SQLAlchemy connection string, like\n",
47+
"`crate://username:password@cratedb.example.org/`."
48+
]
49+
},
50+
{
51+
"metadata": {},
52+
"cell_type": "markdown",
53+
"source": [
54+
"### Installation\n",
55+
"\n",
56+
"Install the **langchain-community** and **sqlalchemy-cratedb** packages."
57+
]
58+
},
59+
{
60+
"metadata": {},
61+
"cell_type": "code",
62+
"source": "%pip install -qU langchain-community sqlalchemy-cratedb",
63+
"outputs": [],
64+
"execution_count": null
65+
},
66+
{
67+
"metadata": {},
68+
"cell_type": "markdown",
69+
"source": [
70+
"## Initialization\n",
71+
"\n",
72+
"Now, initialize the loader and start loading documents. "
73+
]
74+
},
75+
{
76+
"metadata": {},
77+
"cell_type": "code",
78+
"source": [
79+
"from langchain_community.document_loaders import CrateDBLoader\n",
80+
"\n",
81+
"loader = CrateDBLoader(\"SELECT * FROM sys.summits\", url=\"crate://crate@localhost/\")"
82+
],
83+
"outputs": [],
84+
"execution_count": null
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"source": "## Load",
89+
"metadata": {
90+
"collapsed": false
91+
}
92+
},
93+
{
94+
"metadata": {},
95+
"cell_type": "code",
96+
"outputs": [],
97+
"execution_count": null,
98+
"source": [
99+
"documents = loader.load()\n",
100+
"print(documents)"
101+
]
102+
},
103+
{
104+
"metadata": {},
105+
"cell_type": "markdown",
106+
"source": "## Lazy Load\n"
107+
},
108+
{
109+
"metadata": {},
110+
"cell_type": "code",
111+
"outputs": [],
112+
"execution_count": null,
113+
"source": [
114+
"page = []\n",
115+
"for doc in loader.lazy_load():\n",
116+
" page.append(doc)\n",
117+
" if len(page) >= 10:\n",
118+
" # do some paged operation, e.g.\n",
119+
" # index.upsert(page)\n",
120+
"\n",
121+
" page = []"
122+
]
123+
},
124+
{
125+
"metadata": {},
126+
"cell_type": "markdown",
127+
"source": [
128+
"## API reference\n",
129+
"\n",
130+
"For detailed documentation of all PyMuPDFLoader features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html"
131+
]
132+
},
133+
{
134+
"cell_type": "markdown",
135+
"source": [
136+
"## Tutorial\n",
137+
"\n",
138+
"### Populate database."
139+
],
140+
"metadata": {
141+
"collapsed": false
142+
}
143+
},
144+
{
145+
"cell_type": "code",
146+
"metadata": {
147+
"tags": []
148+
},
149+
"source": [
150+
"!crash < ./example_data/mlb_teams_2012.sql\n",
151+
"!crash --command \"REFRESH TABLE mlb_teams_2012;\""
152+
],
153+
"outputs": [],
154+
"execution_count": null
155+
},
156+
{
157+
"cell_type": "markdown",
158+
"source": "### Usage",
159+
"metadata": {
160+
"collapsed": false
161+
}
162+
},
163+
{
164+
"cell_type": "code",
165+
"metadata": {
166+
"tags": []
167+
},
168+
"source": [
169+
"from langchain.document_loaders import CrateDBLoader\n",
170+
"from pprint import pprint\n",
171+
"\n",
172+
"CONNECTION_STRING = \"crate://crate@localhost/\"\n",
173+
"\n",
174+
"loader = CrateDBLoader(\n",
175+
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
176+
" url=CONNECTION_STRING,\n",
177+
")\n",
178+
"documents = loader.load()"
179+
],
180+
"outputs": [],
181+
"execution_count": null
182+
},
183+
{
184+
"cell_type": "code",
185+
"metadata": {
186+
"tags": []
187+
},
188+
"source": [
189+
"pprint(documents)"
190+
],
191+
"outputs": [],
192+
"execution_count": null
193+
},
194+
{
195+
"cell_type": "markdown",
196+
"metadata": {},
197+
"source": "### Specifying Which Columns are Content vs Metadata"
198+
},
199+
{
200+
"cell_type": "code",
201+
"metadata": {},
202+
"source": [
203+
"loader = CrateDBLoader(\n",
204+
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
205+
" url=CONNECTION_STRING,\n",
206+
" page_content_columns=[\"Team\"],\n",
207+
" metadata_columns=[\"Payroll (millions)\"],\n",
208+
")\n",
209+
"documents = loader.load()"
210+
],
211+
"outputs": [],
212+
"execution_count": null
213+
},
214+
{
215+
"cell_type": "code",
216+
"metadata": {},
217+
"source": [
218+
"pprint(documents)"
219+
],
220+
"outputs": [],
221+
"execution_count": null
222+
},
223+
{
224+
"cell_type": "markdown",
225+
"metadata": {},
226+
"source": "### Adding Source to Metadata"
227+
},
228+
{
229+
"cell_type": "code",
230+
"metadata": {},
231+
"source": [
232+
"loader = CrateDBLoader(\n",
233+
" 'SELECT * FROM mlb_teams_2012 ORDER BY \"Team\" LIMIT 5;',\n",
234+
" url=CONNECTION_STRING,\n",
235+
" source_columns=[\"Team\"],\n",
236+
")\n",
237+
"documents = loader.load()"
238+
],
239+
"outputs": [],
240+
"execution_count": null
241+
},
242+
{
243+
"cell_type": "code",
244+
"metadata": {},
245+
"source": [
246+
"pprint(documents)"
247+
],
248+
"outputs": [],
249+
"execution_count": null
250+
}
251+
],
252+
"metadata": {
253+
"kernelspec": {
254+
"display_name": "Python 3 (ipykernel)",
255+
"language": "python",
256+
"name": "python3"
257+
},
258+
"language_info": {
259+
"codemirror_mode": {
260+
"name": "ipython",
261+
"version": 3
262+
},
263+
"file_extension": ".py",
264+
"mimetype": "text/x-python",
265+
"name": "python",
266+
"nbconvert_exporter": "python",
267+
"pygments_lexer": "ipython3",
268+
"version": "3.10.6"
269+
}
270+
},
271+
"nbformat": 4,
272+
"nbformat_minor": 4
273+
}

docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
-- Provisioning table "mlb_teams_2012".
22
--
33
-- psql postgresql://postgres@localhost < mlb_teams_2012.sql
4+
-- crash < mlb_teams_2012.sql
45

56
DROP TABLE IF EXISTS mlb_teams_2012;
67
CREATE TABLE mlb_teams_2012 ("Team" VARCHAR, "Payroll (millions)" FLOAT, "Wins" BIGINT);

0 commit comments

Comments
 (0)