pythoninchemistry
diff --git a/‎notebooks/DataFrame.png‎
38.7 KB b/‎notebooks/DataFrame.png‎
38.7 KB
diff --git a/‎notebooks/intro_to_pandas.ipynb‎
Lines changed: 327 additions & 0 deletions b/‎notebooks/intro_to_pandas.ipynb‎
Lines changed: 327 additions & 0 deletions
@@ -0,0 +1,327 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4e9376f84a7c26bc",
+   "metadata": {},
+   "source": [
+    "# Introduction to the Pandas Library"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "16a10868d3450642",
+   "metadata": {},
+   "source": [
+    "*pandas* is a library within python that is designed to be used for data analysis. It is similar to Excel as it can handle large datasets, but with\n",
+    " the advantage of being able to manipulate the data in a programmable way.\n",
+    " You can\n",
+    "find the pandas documentation [here](https://pandas.pydata.org/docs/).\n",
+    "\n",
+    "\n",
+    "There is an [introductory video available](https://youtu.be/_T8LGqJtuGc) that tries to teach the basics of pands in just 10 minutes!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ddeb90892d82a5b",
+   "metadata": {},
+   "source": [
+    "### Prerequisites\n",
+    "- variables and data types\n",
+    "- libraries (not sure if this is needed)\n",
+    "- Boolean operators\n",
+    "- print\n",
+    "- f-strings"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a73114b516278ac5",
+   "metadata": {},
+   "source": [
+    "### Learning Outcomes\n",
+    "- Read and write files\n",
+    "- Understand what a dataframe is\n",
+    "- Check files are imported correctly\n",
+    "- Select a subset of a DataFrame\n",
+    "- Add new columns to a dataframe\n",
+    "- Calculate summary statistics\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5409de65537887d8",
+   "metadata": {},
+   "source": [
+    "The community standard alias for the pandas package is *pd*, which is assumed in the pandas documentation and in a lot of code you may see online."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "705306f1027fa7e",
+   "metadata": {},
+   "source": "import pandas as pd",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "159944926f25cdc9",
+   "metadata": {},
+   "source": "## Reading files"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9f8ce7a24299e71c",
+   "metadata": {},
+   "source": [
+    "In pandas, it is useful to read data into a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame),\n",
+    "which is similar to an Excel spreadsheet:\n",
+    "\n",
+    "![Pandas DataFrame](DataFrame.png)\n",
+    "\n",
+    "There are many ways to read data into pandas depending on the file type, but for regular delimited files,\n",
+    " the function [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) can be used."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "6ef4f4222b561d3e",
+   "metadata": {},
+   "source": [
+    "data = pd.read_csv(\"periodic_table.csv\")\n",
+    "data"
+   ],
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "946227594d5d4492",
+   "metadata": {},
+   "source": [
+    "> This function assumes the data is comma separated, for other separators you can specify it using the delimiter parameter. If the separator is not a\n",
+    "regular character (e.g. a tab, multiple spaces), an internet search should tell you what string to use. E.g. for a *tab* separated file:\n",
+    ">\n",
+    "> ```data_tab = pd.read_csv(\"**need to get a file**\", delimiter=\"\\t\")```\n",
+    ">\n",
+    "> There are other parameters available, to specify the headers, the datatype etc. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for full details.\n"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": "### Viewing the data",
+   "id": "613367f256897f36"
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "Now that we have imported the data, it is important to view it is fully understand how it is formatted and ensure we imported it correctly. As you\n",
+    "may have noticed, when we try to display the dataframe, only some of the rows display. This is because only the first and last 5 rows will be shown\n",
+    " by default. There are functions we can use to display specific\n",
+    "parts of the\n",
+    "dataframe:\n",
+    "\n",
+    "- `data.head()` shows rows from the top of the file\n",
+    "- `data.tail()` shows rows from the bottom of the file\n",
+    "- `data.columns` shows the column names (header)\n",
+    "\n",
+    "If a number is given to `head` and `tail`, it will display that many rows.\n",
+    "\n",
+    "It can also be useful to check how pandas *interpreted* the data, and then change it if necessary. The data type can be checked using `.dtypes` and\n",
+    "it can be changed using `.astype()`.\n",
+    "\n",
+    "To display the datatype of all columns, we can run the function on the whole dataframe:"
+   ],
+   "id": "c00ce268787d2503"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "data.dtypes",
+   "id": "de5e7c4b8c29071a",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": "Or we can instead run the function on only one column:",
+   "id": "5d9551818a2553db"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "data[\"AtomicNumber\"].dtype",
+   "id": "e4f7fa55f0ad8042",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": "To change the data type, we need to reassign that column. E.g. to change the \"Name\" data to a string:",
+   "id": "b870cf77a1aea35f"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "print(f'Data type before change: {data[\"Name\"].dtype}')\n",
+    "data[\"Name\"] = data[\"Name\"].astype(\"string\")\n",
+    "print(f'Data type after change: {data[\"Name\"].dtype}')"
+   ],
+   "id": "d976fecb52130b29",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## Exercise\n",
+    "\n",
+    "Display the first 8 elements."
+   ],
+   "id": "822ab5f3e84a6ff2"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "# Add your answer here",
+   "id": "bce6df361acf974",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "# Answer\n",
+    "data.head(8)"
+   ],
+   "id": "ac14452b9f70836e",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": "What element has atomic number 110? Hint: The table has 118 elements in it.",
+   "id": "ba7c9cb041afd40d"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "# Add your answer here",
+   "id": "1c4beea42f5bb2d8",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "# Answer\n",
+    "data.tail(9)\n",
+    "\n",
+    "# The element with an atomic number of 110 is Darmstadtium."
+   ],
+   "id": "82f5627d2fea26b7",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": "Change the \"Symbol\" data to strings. Check the data type of the column after.",
+   "id": "9885f5ed07d28703"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "# Add your answer here",
+   "id": "7fa9904a9de0f284",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "# Answer\n",
+    "data[\"Symbol\"] = data[\"Symbol\"].astype(\"string\")\n",
+    "print(f'Data type after change: {data[\"Symbol\"].dtype}')"
+   ],
+   "id": "d6403b10cf05d3b9",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## Writing files\n",
+    "\n",
+    "As with reading files, there are many ways to write data to a file depending on the file type wanted, but for regular delimited files,\n",
+    " the function [`to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) can be used.\n",
+    "\n",
+    "As DataFrames have an index column, we have to decide if we want to keep this or not. We can do this using the `index` parameter. To **NOT**\n",
+    "include the index column, use `index=False`."
+   ],
+   "id": "420135f8853d1421"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "data.to_csv(\"periodic_table_out.csv\", index=False)",
+   "id": "484f5eeecf6e9533",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "> As with reading files, we can specify what separator we want the data to be written using `sep`. There are many other useful parameters for\n",
+    "> specifying what data to save and how to save it. See [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) for more infromation."
+   ],
+   "id": "8cb03b854e801781"
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "# To Do\n",
+    "- select a subset of a df\n",
+    "- create new columns\n",
+    "- calculate statistics"
+   ],
+   "id": "73f5ded338418595"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}