Skip to content

Commit 4c23530

Browse files
author
codebasics
committed
tokeniztion
1 parent c46f9e6 commit 4c23530

File tree

6 files changed

+1150
-0
lines changed

6 files changed

+1150
-0
lines changed
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"<h3 align=\"center\">Spacy Tokenizer Exercise Solution</h3>"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"<h3>Collecting dataset websites from a book paragraph</h3>"
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"Collecting Data Websites From Think Stats Book Paragraph\n",
22+
"\n",
23+
"https://greenteapress.com/thinkstats2/thinkstats2.pdf"
24+
]
25+
},
26+
{
27+
"cell_type": "code",
28+
"execution_count": 12,
29+
"metadata": {},
30+
"outputs": [],
31+
"source": [
32+
"text='''\n",
33+
"Look for data to help you address the question. Governments are good\n",
34+
"sources because data from public research is often freely available. Good\n",
35+
"places to start include http://www.data.gov/, and http://www.science.\n",
36+
"gov/, and in the United Kingdom, http://data.gov.uk/.\n",
37+
"Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, \n",
38+
"and the European Social Survey at http://www.europeansocialsurvey.org/.\n",
39+
"'''"
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": 13,
45+
"metadata": {
46+
"scrolled": false
47+
},
48+
"outputs": [
49+
{
50+
"data": {
51+
"text/plain": [
52+
"['http://www.data.gov/',\n",
53+
" 'http://www.science',\n",
54+
" 'http://data.gov.uk/.',\n",
55+
" 'http://www3.norc.org/gss+website/',\n",
56+
" 'http://www.europeansocialsurvey.org/.']"
57+
]
58+
},
59+
"execution_count": 13,
60+
"metadata": {},
61+
"output_type": "execute_result"
62+
}
63+
],
64+
"source": [
65+
"doc = nlp(text)\n",
66+
"data_websites = [token.text for token in doc if token.like_url ] \n",
67+
"data_websites"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {},
73+
"source": [
74+
"<h3>Figure out all transactions from this text with amount and currency</h3>"
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": 15,
80+
"metadata": {},
81+
"outputs": [
82+
{
83+
"name": "stdout",
84+
"output_type": "stream",
85+
"text": [
86+
"two $\n",
87+
"500 €\n"
88+
]
89+
}
90+
],
91+
"source": [
92+
"transactions = \"Tony gave two $ to Peter, Bruce gave 500 € to Steve\"\n",
93+
"doc = nlp(transactions)\n",
94+
"for token in doc:\n",
95+
" if token.like_num and doc[token.i+1].is_currency:\n",
96+
" print(token.text, doc[token.i+1].text) "
97+
]
98+
}
99+
],
100+
"metadata": {
101+
"kernelspec": {
102+
"display_name": "Python 3",
103+
"language": "python",
104+
"name": "python3"
105+
},
106+
"language_info": {
107+
"codemirror_mode": {
108+
"name": "ipython",
109+
"version": 3
110+
},
111+
"file_extension": ".py",
112+
"mimetype": "text/x-python",
113+
"name": "python",
114+
"nbconvert_exporter": "python",
115+
"pygments_lexer": "ipython3",
116+
"version": "3.8.5"
117+
}
118+
},
119+
"nbformat": 4,
120+
"nbformat_minor": 4
121+
}

4_tokenization/Exercise/students.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
Dayton high school, 8th grade students information
2+
==================================================
3+
4+
Name birth day email
5+
----- ------------ ------
6+
Virat 5 June, 1882 virat@kohli.com
7+
Maria 12 April, 2001 maria@sharapova.com
8+
Serena 24 June, 1998 serena@williams.com
9+
Joe 1 May, 1997 joe@root.com
10+
11+
12+
61.3 KB
Loading

0 commit comments

Comments
 (0)