Skip to content

Commit 860dcaa

Browse files
authored
Add files via upload
1 parent 845f687 commit 860dcaa

File tree

2 files changed

+3540
-0
lines changed

2 files changed

+3540
-0
lines changed
Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
{
2+
"nbformat": 4,
3+
"nbformat_minor": 0,
4+
"metadata": {
5+
"colab": {
6+
"provenance": []
7+
},
8+
"kernelspec": {
9+
"name": "python3",
10+
"display_name": "Python 3"
11+
},
12+
"language_info": {
13+
"name": "python"
14+
}
15+
},
16+
"cells": [
17+
{
18+
"cell_type": "code",
19+
"execution_count": null,
20+
"metadata": {
21+
"id": "WNtUZ497_0kd"
22+
},
23+
"outputs": [],
24+
"source": [
25+
"!pip install 'datafog==3.0.1'"
26+
]
27+
},
28+
{
29+
"cell_type": "markdown",
30+
"source": [
31+
"# Example: Annotating PII from text\n"
32+
],
33+
"metadata": {
34+
"id": "GWl2SrygBvM8"
35+
}
36+
},
37+
{
38+
"cell_type": "markdown",
39+
"source": [
40+
"## Setup"
41+
],
42+
"metadata": {
43+
"id": "0p1U8dU8KBS3"
44+
}
45+
},
46+
{
47+
"cell_type": "code",
48+
"source": [
49+
"from pyspark.sql import SparkSession\n",
50+
"spark = SparkSession.builder \\\n",
51+
" .appName(\"DataFog\") \\\n",
52+
" .config(\"spark.driver.memory\", \"8g\") \\\n",
53+
" .config(\"spark.executor.memory\", \"8g\") \\\n",
54+
" .getOrCreate()"
55+
],
56+
"metadata": {
57+
"id": "fwv9QpGIEuAn"
58+
},
59+
"execution_count": null,
60+
"outputs": []
61+
},
62+
{
63+
"cell_type": "markdown",
64+
"source": [],
65+
"metadata": {
66+
"id": "DVd1AtvqIkuA"
67+
}
68+
},
69+
{
70+
"cell_type": "markdown",
71+
"source": [
72+
"## Spark Functions to broadcast over DataFrame"
73+
],
74+
"metadata": {
75+
"id": "2u8MhJAPImW0"
76+
}
77+
},
78+
{
79+
"cell_type": "code",
80+
"source": [
81+
"from pyspark.sql import SparkSession\n",
82+
"from pyspark.sql.functions import udf\n",
83+
"from pyspark.sql.types import ArrayType, StringType\n",
84+
"from pyspark.sql.types import StructType, StructField, StringType\n",
85+
"import spacy\n",
86+
"import requests\n",
87+
"\n",
88+
"PII_ANNOTATION_LABELS = [\"DATE_TIME\", \"LOC\", \"NRP\", \"ORG\", \"PER\"]\n",
89+
"MAXIMAL_STRING_SIZE = 1000000\n",
90+
"\n",
91+
"def pii_annotator(text: str, broadcasted_nlp) -> list[list[str]]:\n",
92+
" \"\"\"Extract features using en_spacy_pii_fast model.\n",
93+
"\n",
94+
" Returns:\n",
95+
" list[list[str]]: Values as arrays in order defined in the PII_ANNOTATION_LABELS.\n",
96+
" \"\"\"\n",
97+
" if text:\n",
98+
" if len(text) > MAXIMAL_STRING_SIZE:\n",
99+
" # Cut the strings for required sizes\n",
100+
" text = text[:MAXIMAL_STRING_SIZE]\n",
101+
" nlp = broadcasted_nlp.value\n",
102+
" doc = nlp(text)\n",
103+
"\n",
104+
" # Pre-create dictionary with labels matching to expected extracted entities\n",
105+
" classified_entities: dict[str, list[str]] = {\n",
106+
" _label: [] for _label in PII_ANNOTATION_LABELS\n",
107+
" }\n",
108+
" for ent in doc.ents:\n",
109+
" # Add entities from extracted values\n",
110+
" classified_entities[ent.label_].append(ent.text)\n",
111+
"\n",
112+
" return [_ent for _ent in classified_entities.values()]\n",
113+
" else:\n",
114+
" return [[] for _ in PII_ANNOTATION_LABELS]\n",
115+
"\n",
116+
"def broadcast_pii_annotator_udf(spark_session: SparkSession, spacy_model: str = \"en_spacy_pii_fast\"):\n",
117+
" \"\"\"Broadcast PII annotator across Spark cluster and create UDF\"\"\"\n",
118+
" broadcasted_nlp = spark_session.sparkContext.broadcast(\n",
119+
" spacy.load(spacy_model)\n",
120+
" )\n",
121+
"\n",
122+
" pii_annotation_udf = udf(\n",
123+
" lambda text: pii_annotator(text, broadcasted_nlp),\n",
124+
" ArrayType(ArrayType(StringType())),\n",
125+
" )\n",
126+
" return pii_annotation_udf"
127+
],
128+
"metadata": {
129+
"id": "H-q24tYIF-Bw"
130+
},
131+
"execution_count": null,
132+
"outputs": []
133+
},
134+
{
135+
"cell_type": "code",
136+
"source": [
137+
"sotu_url = 'https://gist.githubusercontent.com/sidmohan0/1aa3ec38b4e6594d3c34b113f2e0962d/raw/42e57146197be0f85a5901cd1dcdd9ad15b31bab/sotu_2023.txt'\n",
138+
"\n",
139+
"# Fetch the content of the text file\n",
140+
"response = requests.get(sotu_url)\n",
141+
"sotu_text = response.text\n",
142+
"\n",
143+
"# Create a DataFrame from the text data\n",
144+
"df = spark.createDataFrame([(line,) for line in sotu_text.split('\\n') if line], [\"text\"])\n",
145+
"df.show()\n"
146+
],
147+
"metadata": {
148+
"colab": {
149+
"base_uri": "https://localhost:8080/"
150+
},
151+
"id": "-GzGCpA6JKqB",
152+
"outputId": "491e31ea-d965-4d8b-8a58-c6e202d6d01b"
153+
},
154+
"execution_count": null,
155+
"outputs": [
156+
{
157+
"output_type": "stream",
158+
"name": "stdout",
159+
"text": [
160+
"+--------------------+\n",
161+
"| text|\n",
162+
"+--------------------+\n",
163+
"|Mr. Speaker, Mada...|\n",
164+
"|And, by the way, ...|\n",
165+
"|Members of the Ca...|\n",
166+
"|You know, I start...|\n",
167+
"|Speaker, I don’t ...|\n",
168+
"|And I want to con...|\n",
169+
"|He won despite th...|\n",
170+
"|Congratulations t...|\n",
171+
"|And congratulatio...|\n",
172+
"|Well, I tell you ...|\n",
173+
"|Folks, the story ...|\n",
174+
"|We’re the only co...|\n",
175+
"|Look, folks, that...|\n",
176+
"|Two years ago, th...|\n",
177+
"|Two years ago — a...|\n",
178+
"|And two years ago...|\n",
179+
"|As we gather here...|\n",
180+
"|When world leader...|\n",
181+
"|You know, we’re o...|\n",
182+
"|Yes, we disagreed...|\n",
183+
"+--------------------+\n",
184+
"only showing top 20 rows\n",
185+
"\n"
186+
]
187+
}
188+
]
189+
},
190+
{
191+
"cell_type": "markdown",
192+
"source": [
193+
"# Feature Extraction"
194+
],
195+
"metadata": {
196+
"id": "6SkvKcgKJ79m"
197+
}
198+
},
199+
{
200+
"cell_type": "code",
201+
"source": [
202+
"extract_features_udf = broadcast_pii_annotator_udf(spark, spacy_model=\"en_spacy_pii_fast\")\n",
203+
"\n",
204+
"df = df.withColumn(\"en_spacy_pii_fast\", extract_features_udf(df.text))\n",
205+
"df.show(truncate=False)"
206+
],
207+
"metadata": {
208+
"colab": {
209+
"base_uri": "https://localhost:8080/"
210+
},
211+
"id": "euhCyV-4JoU0",
212+
"outputId": "99e883a3-3e02-497c-bd58-4b50ac3996d4"
213+
},
214+
"execution_count": null,
215+
"outputs": [
216+
{
217+
"output_type": "stream",
218+
"name": "stdout",
219+
"text": [
220+
"+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+\n",
221+
"|text |en_spacy_pii_fast |\n",
222+
"+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+\n",
223+
"|Mr. Speaker, Madam Vice President, our First Lady and Second Gentleman — good to see you guys up there — members of Congress — |[[], [Madam], [], [First Lady, Congress], [Speaker, Second Gentleman]] |\n",
224+
"|And, by the way, Chief Justice, I may need a court order. She gets to go to the game tomorr- — next week. I have to stay home. We got to work something out here. |[[next week], [], [], [], [Chief Justice]] |\n",
225+
"|Members of the Cabinet, leaders of our military, Chief Justice, Associate Justices, and retired Justices of the Supreme Court, and to you, my fellow Americans: |[[], [], [Americans], [Cabinet, Chief Justice, the Supreme Court], []] |\n",
226+
"|You know, I start tonight by congratulating the 118th Congress and the new Speaker of the House, Kevin McCarthy. |[[tonight], [], [], [Congress, House], [Kevin McCarthy]] |\n",
227+
"|Speaker, I don’t want to ruin your reputation, but I look forward to working with you. |[[], [], [], [], []] |\n",
228+
"|And I want to congratulate the new Leader of the House Democrats, the first African American Minority Leader in history, Hakeem Jeffries. |[[], [], [Democrats, African American], [House, Hakeem Jeffries], []] |\n",
229+
"|He won despite the fact I campaigned for him. |[[], [], [], [], []] |\n",
230+
"|Congratulations to the longest-serving Leader in the history of the United States Senate, Mitch McConnell. Where are you, Mitch? |[[], [the United States Senate], [], [Leader], [Mitch McConnell, Mitch]] |\n",
231+
"|And congratulations to Chuck Schumer, another — you know, another term as Senate Minority [Majority] Leader. You know, I think you — only this time you have a slightly bigger majority, Mr. Leader. And you’re the Majority Leader. About that much bigger? Yeah. |[[], [Yeah], [], [Senate, Leader], [Chuck Schumer, Leader]] |\n",
232+
"|Well, I tell you what — I want to give specolec- — special recognition to someone who I think is going to be considered the greatest Speaker in the history of the House of Representatives: Nancy Pelosi. |[[], [], [], [the House of Representatives], [Nancy Pelosi]] |\n",
233+
"|Folks, the story of America is a story of progress and resilience, of always moving forward, of never, ever giving up. It’s a story unique among all nations. |[[], [America], [], [], []] |\n",
234+
"|We’re the only country that has emerged from every crisis we’ve ever entered stronger than we got into it. |[[], [], [], [], []] |\n",
235+
"|Look, folks, that’s what we’re doing again. |[[], [], [], [], []] |\n",
236+
"|Two years ago, the economy was reeling. I stand here tonight, after we’ve created, with the help of many people in this room, 12 million new jobs — more jobs created in two years than any President has created in four years — because of you all, because of the American people.|[[Two years ago, tonight, two years, four years], [], [American], [], []]|\n",
237+
"|Two years ago — and two years ago, COVID had shut down — our businesses were closed, our schools were robbed of so much. And today, COVID no longer controls our lives. |[[Two years ago, two years ago, today], [], [], [COVID], []] |\n",
238+
"|And two years ago, our democracy faced its greatest threat since the Civil War. And today, though bruised, our democracy remains unbowed and unbroken. |[[two years ago, today], [], [], [], []] |\n",
239+
"|As we gather here tonight, we’re writing the next chapter in the great American story — a story of progress and resilience. |[[tonight], [], [American], [], []] |\n",
240+
"|When world leaders ask me to define America — and they do, believe it or not — I say I can define it in one word, and I mean this: possibilities. We don’t think anything is beyond our capacity. Everything is a possibility. |[[], [America], [], [], []] |\n",
241+
"|You know, we’re often told that Democrats and Republicans can’t work together. But over the past two years, we proved the cynics and naysayers wrong. |[[the past two years], [], [Democrats, Republicans], [], []] |\n",
242+
"|Yes, we disagreed plenty. And yes, there were times when Democrats went alone. |[[], [], [Democrats], [], []] |\n",
243+
"+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+\n",
244+
"only showing top 20 rows\n",
245+
"\n"
246+
]
247+
}
248+
]
249+
},
250+
{
251+
"cell_type": "markdown",
252+
"source": [
253+
"#"
254+
],
255+
"metadata": {
256+
"id": "-Abubt0jKPRD"
257+
}
258+
}
259+
]
260+
}

0 commit comments

Comments
 (0)