Skip to content

Commit 6a905e3

Browse files
Uploading and fixing example notebooks to spark-nlp (#14137)
* adding Classifier Training notebook using INSTRUCTOR Embeddings * adding NER training using DeBertaEmbeddings * adding example notebook for DocumentTokenSplitter * Delete OpenAICompletion.ipynb for replacing * Create openai-completion * fixing OpenAICompletion updating OpenAICompletion model from text-davinci-003 to gpt-3.5 turbo Fixing Null Colab Link
1 parent 8677147 commit 6a905e3

File tree

5 files changed

+377
-2
lines changed

5 files changed

+377
-2
lines changed
Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "97EiXueJA9cY"
7+
},
8+
"source": [
9+
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
10+
]
11+
},
12+
{
13+
"cell_type": "markdown",
14+
"metadata": {
15+
"id": "zmxL_blSA9ce"
16+
},
17+
"source": [
18+
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/DocumentTokenSplitter.ipynb)"
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {
24+
"id": "uI7yhCibA9cf"
25+
},
26+
"source": [
27+
"## Colab + Data Setup"
28+
]
29+
},
30+
{
31+
"cell_type": "code",
32+
"execution_count": 10,
33+
"metadata": {
34+
"colab": {
35+
"base_uri": "https://localhost:8080/"
36+
},
37+
"id": "4WQLLrIUA9cg",
38+
"outputId": "93e96731-45c2-4c82-97fe-f08472b649fe"
39+
},
40+
"outputs": [
41+
{
42+
"name": "stdout",
43+
"output_type": "stream",
44+
"text": [
45+
"Installing PySpark 3.2.3 and Spark NLP 5.2.2\n",
46+
"setup Colab for PySpark 3.2.3 and Spark NLP 5.2.2\n"
47+
]
48+
}
49+
],
50+
"source": [
51+
"!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
52+
]
53+
},
54+
{
55+
"cell_type": "code",
56+
"execution_count": 11,
57+
"metadata": {
58+
"id": "nVTDX8SdiSD9"
59+
},
60+
"outputs": [],
61+
"source": [
62+
"!wget https://github.com/JohnSnowLabs/spark-nlp/blob/587f79020de7bc09c2b2fceb37ec258bad57e425/src/test/resources/spell/sherlockholmes.txt > /dev/null 2>&1"
63+
]
64+
},
65+
{
66+
"cell_type": "markdown",
67+
"metadata": {
68+
"id": "_S-XJDfUA9ci"
69+
},
70+
"source": [
71+
"# Download DocumentTokenSplitter Model and Create Spark NLP Pipeline"
72+
]
73+
},
74+
{
75+
"cell_type": "code",
76+
"execution_count": 12,
77+
"metadata": {
78+
"colab": {
79+
"base_uri": "https://localhost:8080/"
80+
},
81+
"id": "KzMHa0HdA9ch",
82+
"outputId": "a1c6ff34-8b07-40e6-c207-b6f77894ad74"
83+
},
84+
"outputs": [
85+
{
86+
"name": "stdout",
87+
"output_type": "stream",
88+
"text": [
89+
"Warning::Spark Session already created, some configs may not take.\n",
90+
"Spark NLP version 5.2.2\n",
91+
"Apache Spark version: 3.2.3\n"
92+
]
93+
}
94+
],
95+
"source": [
96+
"import sparknlp\n",
97+
"from sparknlp.base import *\n",
98+
"from sparknlp.annotator import *\n",
99+
"from pyspark.ml import Pipeline\n",
100+
"\n",
101+
"spark = sparknlp.start()\n",
102+
"\n",
103+
"print(f\"Spark NLP version {sparknlp.version()}\\nApache Spark version: {spark.version}\")"
104+
]
105+
},
106+
{
107+
"cell_type": "code",
108+
"execution_count": 13,
109+
"metadata": {
110+
"id": "6qAa9p6ohtfi"
111+
},
112+
"outputs": [],
113+
"source": [
114+
"textDF = spark.read.text(\n",
115+
" \"sherlockholmes.txt\",\n",
116+
" wholetext=True\n",
117+
").toDF(\"text\")"
118+
]
119+
},
120+
{
121+
"cell_type": "code",
122+
"execution_count": 14,
123+
"metadata": {
124+
"colab": {
125+
"base_uri": "https://localhost:8080/"
126+
},
127+
"id": "DVHludGFMSCk",
128+
"outputId": "bced22c6-794b-4fd8-ad78-2bc0a1880f5a"
129+
},
130+
"outputs": [
131+
{
132+
"data": {
133+
"text/plain": [
134+
"sparknlp.annotator.document_token_splitter.DocumentTokenSplitter"
135+
]
136+
},
137+
"execution_count": 14,
138+
"metadata": {},
139+
"output_type": "execute_result"
140+
}
141+
],
142+
"source": [
143+
"DocumentTokenSplitter"
144+
]
145+
},
146+
{
147+
"cell_type": "markdown",
148+
"metadata": {
149+
"id": "O4uPbdrSA9ci"
150+
},
151+
"source": [
152+
"Lets create a Spark NLP pipeline with the following stages:"
153+
]
154+
},
155+
{
156+
"cell_type": "code",
157+
"execution_count": 15,
158+
"metadata": {
159+
"colab": {
160+
"base_uri": "https://localhost:8080/"
161+
},
162+
"id": "ASQ5Ot2NA9ci",
163+
"outputId": "3a8c06d6-f8ce-442f-b8c9-b107610d7b54"
164+
},
165+
"outputs": [
166+
{
167+
"name": "stdout",
168+
"output_type": "stream",
169+
"text": [
170+
"+--------------------------------------------------------------------------------+-----+-----+------+------+\n",
171+
"| result|begin| end|length|tokens|\n",
172+
"+--------------------------------------------------------------------------------+-----+-----+------+------+\n",
173+
"|[{\"payload\":{\"allShortcutsEnabled\":false,\"fileTree\":{\"src/test/resources/spel...| 0|11335| 11335| 512|\n",
174+
"|[the case of the Trepoff murder, of his clearing up\",\"of the singular tragedy...|11280|14436| 3156| 512|\n",
175+
"|[order to remove crusted mud from it.\",\"Hence, you see, my double deduction t...|14379|17697| 3318| 512|\n",
176+
"|[a \\\"P,\\\" and a\",\"large \\\"G\\\" with a small \\\"t\\\" woven into the texture of th...|17644|20993| 3349| 512|\n",
177+
"|[which he had apparently adjusted that very moment,\",\"for his hand was still ...|20928|24275| 3347| 512|\n",
178+
"|[his high white forehead, \\\"you\",\"can understand that I am not accustomed to ...|24214|27991| 3777| 512|\n",
179+
"|[send it on the day when the\",\"betrothal was publicly proclaimed. That will b...|27927|31354| 3427| 512|\n",
180+
"|[and helpless, in the\",\"chair.\",\"\",\"\\\"What is it?\\\"\",\"\",\"\\\"It's quite too fun...|31273|34428| 3155| 512|\n",
181+
"+--------------------------------------------------------------------------------+-----+-----+------+------+\n",
182+
"only showing top 8 rows\n",
183+
"\n"
184+
]
185+
}
186+
],
187+
"source": [
188+
"documentAssembler = DocumentAssembler() \\\n",
189+
" .setInputCol(\"text\") \\\n",
190+
" .setOutputCol(\"document\")\n",
191+
"\n",
192+
"textSplitter = DocumentTokenSplitter() \\\n",
193+
" .setInputCols([\"document\"]) \\\n",
194+
" .setOutputCol(\"splits\") \\\n",
195+
" .setNumTokens(512) \\\n",
196+
" .setTokenOverlap(10) \\\n",
197+
" .setExplodeSplits(True)\n",
198+
"\n",
199+
"pipeline = Pipeline().setStages([documentAssembler, textSplitter])\n",
200+
"result = pipeline.fit(textDF).transform(textDF)\n",
201+
"\n",
202+
"result.selectExpr(\n",
203+
" \"splits.result as result\",\n",
204+
" \"splits[0].begin as begin\",\n",
205+
" \"splits[0].end as end\",\n",
206+
" \"splits[0].end - splits[0].begin as length\",\n",
207+
" \"splits[0].metadata.numTokens as tokens\") \\\n",
208+
" .show(8, truncate = 80)"
209+
]
210+
},
211+
{
212+
"cell_type": "markdown",
213+
"metadata": {
214+
"id": "CALoU6tSofto"
215+
},
216+
"source": [
217+
"# Now let's make another pipeline to see if this actually works!"
218+
]
219+
},
220+
{
221+
"cell_type": "markdown",
222+
"metadata": {
223+
"id": "H5DFx2DOosri"
224+
},
225+
"source": [
226+
"let's get the data ready"
227+
]
228+
},
229+
{
230+
"cell_type": "code",
231+
"execution_count": 16,
232+
"metadata": {
233+
"id": "ZqR7pcQ9pw7a"
234+
},
235+
"outputs": [],
236+
"source": [
237+
"df = spark.createDataFrame([\n",
238+
" [(\"All emotions, and that\\none particularly, were abhorrent to his cold, \"\n",
239+
" \"precise but\\nadmirably balanced mind.\\n\\nHe was, I take it, the most \"\n",
240+
" \"perfect\\nreasoning and observing machine that the world has seen.\")]\n",
241+
"]).toDF(\"text\")\n"
242+
]
243+
},
244+
{
245+
"cell_type": "markdown",
246+
"metadata": {
247+
"id": "ArsOgKafoft0"
248+
},
249+
"source": [
250+
"Lets create a Spark NLP pipeline following the same stages as before:"
251+
]
252+
},
253+
{
254+
"cell_type": "code",
255+
"execution_count": 17,
256+
"metadata": {
257+
"id": "x5ZwHjKSoft2"
258+
},
259+
"outputs": [],
260+
"source": [
261+
"documentAssembler = DocumentAssembler() \\\n",
262+
" .setInputCol(\"text\") \\\n",
263+
" .setOutputCol(\"document\")\n",
264+
"\n",
265+
"document_token_splitter = DocumentTokenSplitter() \\\n",
266+
" .setInputCols(\"document\") \\\n",
267+
" .setOutputCol(\"splits\") \\\n",
268+
" .setNumTokens(3) \\\n",
269+
" .setTokenOverlap(1) \\\n",
270+
" .setExplodeSplits(True) \\\n",
271+
" .setTrimWhitespace(True) \\\n",
272+
"\n",
273+
"pipeline = Pipeline().setStages([documentAssembler, document_token_splitter])\n",
274+
"pipeline_df = pipeline.fit(df).transform(df)\n",
275+
"\n",
276+
"results = pipeline_df.select(\"splits\").collect()\n",
277+
"\n",
278+
"splits = [\n",
279+
" row[\"splits\"][0].result.replace(\"\\n\\n\", \" \").replace(\"\\n\", \" \")\n",
280+
" for row in results\n",
281+
"]"
282+
]
283+
},
284+
{
285+
"cell_type": "markdown",
286+
"metadata": {
287+
"id": "mjUiY6sOp-jY"
288+
},
289+
"source": [
290+
"**Evaluation**"
291+
]
292+
},
293+
{
294+
"cell_type": "code",
295+
"execution_count": 18,
296+
"metadata": {
297+
"colab": {
298+
"base_uri": "https://localhost:8080/"
299+
},
300+
"id": "s5wMKcnVp94o",
301+
"outputId": "9a4ef0f9-76af-403d-81e3-0117e538f887"
302+
},
303+
"outputs": [
304+
{
305+
"data": {
306+
"text/plain": [
307+
"True"
308+
]
309+
},
310+
"execution_count": 18,
311+
"metadata": {},
312+
"output_type": "execute_result"
313+
}
314+
],
315+
"source": [
316+
"expected = [\n",
317+
" \"All emotions, and\",\n",
318+
" \"and that one\",\n",
319+
" \"one particularly, were\",\n",
320+
" \"were abhorrent to\",\n",
321+
" \"to his cold,\",\n",
322+
" \"cold, precise but\",\n",
323+
" \"but admirably balanced\",\n",
324+
" \"balanced mind. He\",\n",
325+
" \"He was, I\",\n",
326+
" \"I take it,\",\n",
327+
" \"it, the most\",\n",
328+
" \"most perfect reasoning\",\n",
329+
" \"reasoning and observing\",\n",
330+
" \"observing machine that\",\n",
331+
" \"that the world\",\n",
332+
" \"world has seen.\",\n",
333+
"]\n",
334+
"\n",
335+
"splits == expected"
336+
]
337+
},
338+
{
339+
"cell_type": "markdown",
340+
"metadata": {
341+
"id": "Wq4G03A2qB5U"
342+
},
343+
"source": [
344+
"Great it works!"
345+
]
346+
}
347+
],
348+
"metadata": {
349+
"colab": {
350+
"provenance": []
351+
},
352+
"kernelspec": {
353+
"display_name": "Python [conda env:tempspark]",
354+
"language": "python",
355+
"name": "conda-env-tempspark-py"
356+
},
357+
"language_info": {
358+
"codemirror_mode": {
359+
"name": "ipython",
360+
"version": 3
361+
},
362+
"file_extension": ".py",
363+
"mimetype": "text/x-python",
364+
"name": "python",
365+
"nbconvert_exporter": "python",
366+
"pygments_lexer": "ipython3",
367+
"version": "3.8.16"
368+
}
369+
},
370+
"nbformat": 4,
371+
"nbformat_minor": 0
372+
}
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

examples/python/training/english/classification/ClassifierDL_Training_using_INSTRUCTOR_Embeddings.ipynb

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)