Skip to content

Commit 90d02ae

Browse files
authored
Merge pull request #23 from sarahvs99/master
Adding pandas lesson (from York 2025 meeting)
2 parents 14f659b + ea138d3 commit 90d02ae

File tree

3 files changed

+446
-0
lines changed

3 files changed

+446
-0
lines changed

notebooks/DataFrame.png

38.7 KB
Loading

notebooks/intro_to_pandas.ipynb

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "4e9376f84a7c26bc",
6+
"metadata": {},
7+
"source": [
8+
"# Introduction to the Pandas Library"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "16a10868d3450642",
14+
"metadata": {},
15+
"source": [
16+
"*pandas* is a library within python that is designed to be used for data analysis. It is similar to Excel as it can handle large datasets, but with\n",
17+
" the advantage of being able to manipulate the data in a programmable way.\n",
18+
" You can\n",
19+
"find the pandas documentation [here](https://pandas.pydata.org/docs/).\n",
20+
"\n",
21+
"\n",
22+
"There is an [introductory video available](https://youtu.be/_T8LGqJtuGc) that tries to teach the basics of pands in just 10 minutes!"
23+
]
24+
},
25+
{
26+
"cell_type": "markdown",
27+
"id": "5ddeb90892d82a5b",
28+
"metadata": {},
29+
"source": [
30+
"### Prerequisites\n",
31+
"- variables and data types\n",
32+
"- libraries (not sure if this is needed)\n",
33+
"- Boolean operators\n",
34+
"- print\n",
35+
"- f-strings"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"id": "a73114b516278ac5",
41+
"metadata": {},
42+
"source": [
43+
"### Learning Outcomes\n",
44+
"- Read and write files\n",
45+
"- Understand what a dataframe is\n",
46+
"- Check files are imported correctly\n",
47+
"- Select a subset of a DataFrame\n",
48+
"- Add new columns to a dataframe\n",
49+
"- Calculate summary statistics\n"
50+
]
51+
},
52+
{
53+
"cell_type": "markdown",
54+
"id": "5409de65537887d8",
55+
"metadata": {},
56+
"source": [
57+
"The community standard alias for the pandas package is *pd*, which is assumed in the pandas documentation and in a lot of code you may see online."
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"id": "705306f1027fa7e",
63+
"metadata": {},
64+
"source": "import pandas as pd",
65+
"outputs": [],
66+
"execution_count": null
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"id": "159944926f25cdc9",
71+
"metadata": {},
72+
"source": "## Reading files"
73+
},
74+
{
75+
"cell_type": "markdown",
76+
"id": "9f8ce7a24299e71c",
77+
"metadata": {},
78+
"source": [
79+
"In pandas, it is useful to read data into a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame),\n",
80+
"which is similar to an Excel spreadsheet:\n",
81+
"\n",
82+
"![Pandas DataFrame](DataFrame.png)\n",
83+
"\n",
84+
"There are many ways to read data into pandas depending on the file type, but for regular delimited files,\n",
85+
" the function [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) can be used."
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"id": "6ef4f4222b561d3e",
91+
"metadata": {},
92+
"source": [
93+
"data = pd.read_csv(\"periodic_table.csv\")\n",
94+
"data"
95+
],
96+
"outputs": [],
97+
"execution_count": null
98+
},
99+
{
100+
"cell_type": "markdown",
101+
"id": "946227594d5d4492",
102+
"metadata": {},
103+
"source": [
104+
"> This function assumes the data is comma separated, for other separators you can specify it using the delimiter parameter. If the separator is not a\n",
105+
"regular character (e.g. a tab, multiple spaces), an internet search should tell you what string to use. E.g. for a *tab* separated file:\n",
106+
">\n",
107+
"> ```data_tab = pd.read_csv(\"**need to get a file**\", delimiter=\"\\t\")```\n",
108+
">\n",
109+
"> There are other parameters available, to specify the headers, the datatype etc. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for full details.\n"
110+
]
111+
},
112+
{
113+
"metadata": {},
114+
"cell_type": "markdown",
115+
"source": "### Viewing the data",
116+
"id": "613367f256897f36"
117+
},
118+
{
119+
"metadata": {},
120+
"cell_type": "markdown",
121+
"source": [
122+
"Now that we have imported the data, it is important to view it is fully understand how it is formatted and ensure we imported it correctly. As you\n",
123+
"may have noticed, when we try to display the dataframe, only some of the rows display. This is because only the first and last 5 rows will be shown\n",
124+
" by default. There are functions we can use to display specific\n",
125+
"parts of the\n",
126+
"dataframe:\n",
127+
"\n",
128+
"- `data.head()` shows rows from the top of the file\n",
129+
"- `data.tail()` shows rows from the bottom of the file\n",
130+
"- `data.columns` shows the column names (header)\n",
131+
"\n",
132+
"If a number is given to `head` and `tail`, it will display that many rows.\n",
133+
"\n",
134+
"It can also be useful to check how pandas *interpreted* the data, and then change it if necessary. The data type can be checked using `.dtypes` and\n",
135+
"it can be changed using `.astype()`.\n",
136+
"\n",
137+
"To display the datatype of all columns, we can run the function on the whole dataframe:"
138+
],
139+
"id": "c00ce268787d2503"
140+
},
141+
{
142+
"metadata": {},
143+
"cell_type": "code",
144+
"source": "data.dtypes",
145+
"id": "de5e7c4b8c29071a",
146+
"outputs": [],
147+
"execution_count": null
148+
},
149+
{
150+
"metadata": {},
151+
"cell_type": "markdown",
152+
"source": "Or we can instead run the function on only one column:",
153+
"id": "5d9551818a2553db"
154+
},
155+
{
156+
"metadata": {},
157+
"cell_type": "code",
158+
"source": "data[\"AtomicNumber\"].dtype",
159+
"id": "e4f7fa55f0ad8042",
160+
"outputs": [],
161+
"execution_count": null
162+
},
163+
{
164+
"metadata": {},
165+
"cell_type": "markdown",
166+
"source": "To change the data type, we need to reassign that column. E.g. to change the \"Name\" data to a string:",
167+
"id": "b870cf77a1aea35f"
168+
},
169+
{
170+
"metadata": {},
171+
"cell_type": "code",
172+
"source": [
173+
"print(f'Data type before change: {data[\"Name\"].dtype}')\n",
174+
"data[\"Name\"] = data[\"Name\"].astype(\"string\")\n",
175+
"print(f'Data type after change: {data[\"Name\"].dtype}')"
176+
],
177+
"id": "d976fecb52130b29",
178+
"outputs": [],
179+
"execution_count": null
180+
},
181+
{
182+
"metadata": {},
183+
"cell_type": "markdown",
184+
"source": [
185+
"## Exercise\n",
186+
"\n",
187+
"Display the first 8 elements."
188+
],
189+
"id": "822ab5f3e84a6ff2"
190+
},
191+
{
192+
"metadata": {},
193+
"cell_type": "code",
194+
"source": "# Add your answer here",
195+
"id": "bce6df361acf974",
196+
"outputs": [],
197+
"execution_count": null
198+
},
199+
{
200+
"metadata": {},
201+
"cell_type": "code",
202+
"source": [
203+
"# Answer\n",
204+
"data.head(8)"
205+
],
206+
"id": "ac14452b9f70836e",
207+
"outputs": [],
208+
"execution_count": null
209+
},
210+
{
211+
"metadata": {},
212+
"cell_type": "markdown",
213+
"source": "What element has atomic number 110? Hint: The table has 118 elements in it.",
214+
"id": "ba7c9cb041afd40d"
215+
},
216+
{
217+
"metadata": {},
218+
"cell_type": "code",
219+
"source": "# Add your answer here",
220+
"id": "1c4beea42f5bb2d8",
221+
"outputs": [],
222+
"execution_count": null
223+
},
224+
{
225+
"metadata": {},
226+
"cell_type": "code",
227+
"source": [
228+
"# Answer\n",
229+
"data.tail(9)\n",
230+
"\n",
231+
"# The element with an atomic number of 110 is Darmstadtium."
232+
],
233+
"id": "82f5627d2fea26b7",
234+
"outputs": [],
235+
"execution_count": null
236+
},
237+
{
238+
"metadata": {},
239+
"cell_type": "markdown",
240+
"source": "Change the \"Symbol\" data to strings. Check the data type of the column after.",
241+
"id": "9885f5ed07d28703"
242+
},
243+
{
244+
"metadata": {},
245+
"cell_type": "code",
246+
"source": "# Add your answer here",
247+
"id": "7fa9904a9de0f284",
248+
"outputs": [],
249+
"execution_count": null
250+
},
251+
{
252+
"metadata": {},
253+
"cell_type": "code",
254+
"source": [
255+
"# Answer\n",
256+
"data[\"Symbol\"] = data[\"Symbol\"].astype(\"string\")\n",
257+
"print(f'Data type after change: {data[\"Symbol\"].dtype}')"
258+
],
259+
"id": "d6403b10cf05d3b9",
260+
"outputs": [],
261+
"execution_count": null
262+
},
263+
{
264+
"metadata": {},
265+
"cell_type": "markdown",
266+
"source": [
267+
"## Writing files\n",
268+
"\n",
269+
"As with reading files, there are many ways to write data to a file depending on the file type wanted, but for regular delimited files,\n",
270+
" the function [`to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) can be used.\n",
271+
"\n",
272+
"As DataFrames have an index column, we have to decide if we want to keep this or not. We can do this using the `index` parameter. To **NOT**\n",
273+
"include the index column, use `index=False`."
274+
],
275+
"id": "420135f8853d1421"
276+
},
277+
{
278+
"metadata": {},
279+
"cell_type": "code",
280+
"source": "data.to_csv(\"periodic_table_out.csv\", index=False)",
281+
"id": "484f5eeecf6e9533",
282+
"outputs": [],
283+
"execution_count": null
284+
},
285+
{
286+
"metadata": {},
287+
"cell_type": "markdown",
288+
"source": [
289+
"> As with reading files, we can specify what separator we want the data to be written using `sep`. There are many other useful parameters for\n",
290+
"> specifying what data to save and how to save it. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) for more infromation."
291+
],
292+
"id": "8cb03b854e801781"
293+
},
294+
{
295+
"metadata": {},
296+
"cell_type": "markdown",
297+
"source": [
298+
"# To Do\n",
299+
"- select a subset of a df\n",
300+
"- create new columns\n",
301+
"- calculate statistics"
302+
],
303+
"id": "73f5ded338418595"
304+
}
305+
],
306+
"metadata": {
307+
"kernelspec": {
308+
"display_name": "Python 3 (ipykernel)",
309+
"language": "python",
310+
"name": "python3"
311+
},
312+
"language_info": {
313+
"codemirror_mode": {
314+
"name": "ipython",
315+
"version": 3
316+
},
317+
"file_extension": ".py",
318+
"mimetype": "text/x-python",
319+
"name": "python",
320+
"nbconvert_exporter": "python",
321+
"pygments_lexer": "ipython3",
322+
"version": "3.9.6"
323+
}
324+
},
325+
"nbformat": 4,
326+
"nbformat_minor": 5
327+
}

0 commit comments

Comments
 (0)