Skip to content

Commit

Permalink
added intake notebook draft
Browse files Browse the repository at this point in the history
  • Loading branch information
scottyhq committed Dec 6, 2019
1 parent 5b7da73 commit 1f9d144
Showing 1 changed file with 252 additions and 0 deletions.
252 changes: 252 additions & 0 deletions notebooks/intake.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intake\n",
"\n",
"\n",
"<div class=\"alert-info\">\n",
"\n",
"### Overview\n",
" \n",
"* **teaching:** 20 minutes\n",
"* **exercises:** 0\n",
"* **questions:**\n",
" * How does Intake simplify data discovery, distribution, and loading?\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Table of contents\n",
"1. [**Intake primer**](#Intake-primer)\n",
"1. [**Build and intake catalog**](#Build-an-intake-catalog)\n",
"1. [**Work with an intake catalog**](#Work-with-an-intake-catalog)\n",
"1. [**Intake xarray example**](#Intake-xarray-example)\n",
"1. [**Intake STAC example**](#Intake-STAC-example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Intake primer\n",
"\n",
"<img src=\"https://intake.readthedocs.io/en/latest/_static/images/logo.png\" alt=\"intake logo\" width=\"200\" align=\"right\"/>\n",
"\n",
"\n",
"[Intake](https://intake.readthedocs.io/en/latest/index.html) is a lightweight package for finding, investigating, loading and disseminating data. This notebook illutrates the usefulness of intake for a \"Data User\". Intake simplifies loading data from [many formats](https://intake.readthedocs.io/en/latest/plugin-directory.html#plugin-directory) into familiar Python objects like Pandas DataFrames or Xarray Datasets. Intake is especially useful for remote datasets - it allows us to bypass downloading data and instead load directly into a Python object for analysis. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build an intake catalog\n",
"\n",
"Let's say we want to save a version of the data from our geopandas.ipynb tutorial for easy sharing and future use. intake has csv support by default but for loading data with geopandas we need to make sure the [intake_geopandas plugin](https://github.com/intake/intake_geopandas) is installed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import intake\n",
"intake.__version__"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Save data locally from our queries\n",
"import pandas as pd\n",
"import geopandas as gpd\n",
"\n",
"server = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?'\n",
"query = 'service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'\n",
"df = pd.read_csv(server+query)\n",
"df.to_csv('votw.csv', index=False)\n",
"\n",
"# Or save as geojson\n",
"# Now load query results as json directly in geopandas\n",
"query = 'service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=json'\n",
"gf = gpd.read_file(server+query)\n",
"gf.to_file('votw.geojson', driver='GeoJSON')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile votw-intake-catalog.yaml\n",
"\n",
"metadata:\n",
" version: 1\n",
"\n",
"sources:\n",
" votw_pandas:\n",
" args:\n",
" csv_kwargs:\n",
" blocksize: null #prevent reading in parallel with dask\n",
" #urlpath: 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'\n",
" urlpath: './votw.csv'\n",
" description: 'Smithsonian_VOTW_Holocene_Volcanoes 4.8.4'\n",
" driver: csv\n",
" metadata:\n",
" citation: 'Global Volcanism Program, 2013. Volcanoes of the World, v. 4.8.4. Venzke, E (ed.). Smithsonian Institution. Downloaded 06 Dec 2019. https://doi.org/10.5479/si.GVP.VOTW4-2013'\n",
" plots:\n",
" last_eruption_year:\n",
" kind: violin\n",
" by: 'Region'\n",
" y: 'Last_Eruption_Year'\n",
" invert: True\n",
" width: 700\n",
" height: 500\n",
" \n",
" \n",
" votw_geopandas:\n",
" args:\n",
" #urlpath: 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=json'\n",
" urlpath: './votw.geojson'\n",
" description: 'Smithsonian_VOTW_Holocene_Volcanoes 4.8.4'\n",
" driver: geojson\n",
" metadata:\n",
" citation: 'Global Volcanism Program, 2013. Volcanoes of the World, v. 4.8.4. Venzke, E (ed.). Smithsonian Institution. Downloaded 06 Dec 2019. https://doi.org/10.5479/si.GVP.VOTW4-2013'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# put this catalog, votw.csv, and votw.geojson, in a public place like GitHub!\n",
"# This facilitates sharing and version controlled analysis\n",
"cat = intake.open_catalog('votw-intake-catalog.yaml')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(list(cat))\n",
"cat.votw_pandas.description"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Loading the data is now very straightforward:\n",
"# We know the data will be read into a Pandas DataFrame because\n",
"cat.votw_pandas.container"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = cat.votw_pandas.read()\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Notice we also specified some pre-defined plots in the catalog\n",
"# This requires hvplot\n",
"import hvplot.pandas\n",
"source = cat.votw_pandas\n",
"source.plot.last_eruption_year()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load a different dataset in the same catalog\n",
"source = cat.votw_geopandas\n",
"source.description"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gf = source.read()\n",
"test = gf.loc[:,['Last_Eruption_Year', 'Volcano_Name', 'geometry']]\n",
"test.hvplot.points(geo=True, hover_cols=['Volcano_Name'], color='Last_Eruption_Year')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Intake xarray example\n",
"\n",
"We've seen a plugin to load geospatial vector data into geopandas geodataframes, there is also a plugin to facilitate loading geospatial raster data into xarray dataarrays! https://github.com/intake/intake-xarray"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Intake STAC example\n",
"\n",
"Instead of creating your own metadata catalogs from scratch as YAML files, intake plugins exist to read catalogs in different formats. For example, for geospatial data on the web, [SpatioTemporal Asset Catalogs (STAC)](https://stacspec.org/) are emerging as a standard way to descripe data that you want to search for based on georeference location, time, and perhaps other metadata fields. The intake-stac plugin greatly facilitates loading datasets referenced in STAC catalogs into Python Xarray objects for analysis. https://github.com/pangeo-data/intake-stac"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

0 comments on commit 1f9d144

Please sign in to comment.