-
Couldn't load subscription status.
- Fork 30
Description
Some users are looking for tools to help them assemble ERDDAP urls for use in their own workflows, while others would prefer to work at a higher, more opinionated level. I believe we can more cleanly separate functionality to help support the spectrum of erddapy users better.
Originally errdapy was meant to be a url builder only. We added the main class later and stop in between trying to support many different usage patterns via the single primary class ERDDAP.
Issues to address
- Users in interactive workflows are required to transform
ERDDAPobjects as they wish to connect to new datasets, for example when moving from searching a server to visiting the datasets. - Adding constraints to a
ERDDAPobject are stateful in place changes, where as most interactive users are used to Numpy/Pandas/xarray style workflows where you can return or chain together changes. - Switching out IO is currently non-trivial due to URL generation and data transformations being tightly coupled to IO.
Proposed Solution
I proposed that we separate erddapy into more functional layers, roughly following the SQLAlchemy core/ORM model.
Core Layer
The core layer would contain two primary components of functionality: url generation & data transformation. This layer makes no choices or assumptions about IO allowing it to be reused easily.
- URL generation - Functions to generate valid URLs from bare components, such as a dataset name, format, and dictionary of constraints to
tabledap/M01_met_all.csv?time%2Cair_temperature%26air_temperature_qc%3D0%26time>%3D"2020-12-09T15%3A25%3A00.000Z" - Data transformation - Functions to convert a raw response (.csv, .nc, ...) into Pandas DataFrames, xarray Datasets...
Object (or opinionated) Layer
The object (or opinionated) layer would present higher level objects for searching servers and accessing datasets with a Pandas or xarray like returning or chainable API compared to the transformational API of the current ERDDAP class.
This layer uses much of the core functionality and presents it in easy to use ways with an opinion as to the access method.
Additionally if possible these objects should be serializable, so they can be pickled and passed to other processes/machines (Dask/Dagster/Prefect).
class ERDDAPConnection- While most ERDDAP servers allow connections via a bare url, some servers may require authentication to access data.
.get(url_part: str) -> bytes or str- Method actually request data.
- Uses requests by default similar to most of the current erddapy data fetching functionality.
- Can be overridden to use httpx, and potentially aiohttp or other async functionality, which could hopefully make anything else async compatible. (investigate await_me_maybe)
.open(url_part: str) -> fp- Yields a file-like object for access (probably use
fsspec.openunder the hood) for file types/tools that don't enjoy getting passed a string.
- Yields a file-like object for access (probably use
@property(server) -> ERDDAPConnection- Return a new
ERDDAPConnectionif trying to set a new server, or change other attributes rather than changing it in place.
- Return a new
For all of the remaining classes, either an ERDDAPConnection or a bare ERDDAP server url that will be transformed into an ERDDAPConnection can be passed in.
-
class ERDDAPServer.__init__(connection: str | ERDDAPConnection).full_text_search(query: str) -> dict[str, ERDDAPDataset]- Use the native ERDDAP full text search capabilities
- Returns a dictionary of search results with dataset ids as keys and
ERDDAPDatasetvalues.
.search(query: str) -> dict[str, ERDDAPDataset]- Points to
.full_text_search
- Points to
advanced_search(**kwargs) -> dict[str, ERDDAPDataset]- Uses ERDDAPs advanced search capabilities (may return pre-filtered datasets)
-
class ERDDAPDatasetBase class for more focused table or grid datasets.
@property(connection)- Underlying
ERDDAPConnection
- Underlying
.get(file_type: str) -> bytes or str- Requests the data using the
.connection.get()method.
- Requests the data using the
.open(file_type: str) -> fp- Yields a file-like object for access.
.get_meta()- Pulls the dataset info and caches it on the
_metaattribute.
- Pulls the dataset info and caches it on the
._meta- Set by
.get_meta() - Passed when a setter returns a subclass.
.attrs -> pd.DataFrame- Dataframe of dataset attributes..variables -> dict- Dictionary of variables as keys, and maximum extent of constraints as values.
- Set by
@property(meta)- Returns the
._metavalues, and will call.get_meta()if they are not already cached.
- Returns the
@property(variables)- List current variables the dataset requested from the dataset.
- Setting
variablesreturns a newERDDAPDatasetsubclass. - If
_metais cached and an invalid variable is set, throw aKeyErrorinstead of returning.
@property(constraints)- Returns the current constraints on the dataset.
- Setting
contraintsreturns a newERDDAPDatasetsubclass. - If
_metais cached and an invalid constraint is set, throw aKeyErrorinstead of returning.
.url_segment(file_type: str) -> str- Everything but the base section of the url (
http://neracoos.org/erddap/), sotabledap/A01_met.csv....
- Everything but the base section of the url (
.url(file_type: str) -> str- Returns a URL constructed using the underlying
ERDDAPConnectionbase class server info, the dataset ID, access method (tabledap/griddap), file type, variables, and constraints. - This allows
ERDDAPDatasetsubclasses to be used as more opinionated URL constructors while still not tying the users to an specific IO method. - Not guaranteed to capture all the specifics of formatting a request, such as if a server requires specific auth or headers.
- Returns a URL constructed using the underlying
.to_dataset()- Open the dataset as an xarray dataset by downloading a subset NetCDF..opendap_dataset()- Open the full dataset in xarray via OpenDAP.
-
class TableDataset(ERDDAPDataset)- .
to_dataframe()- Open the dataset as a Pandas DataFrame.
- .
-
class GridDataset(ERDDAPDataset)
In Practice
So how do these work in practice?
Let's look at a few different scenarios.
Interactive Search
Lets say that a user wants to find and query all datasets on a server that contain sea_water_temperature data?
First they initialize their server object.
This can be done by passing in the server URL, the short name of the server, or an ERDDAPConnection object if authentication or IO methods need to be overridden.
[1] from erddapy import ERDDAPServer
[2] server = ERDDAPServer("neracoos")Then they can use the native ERDDAP full text search to find datasets.
[3] water_temp_datasets = server.search("sea_water_temperature")
water_temp_datasets
[3] {"nefsc_emolt_erddap": <TableDataset ...>, "UCONN_ARTG_WQ_BTM": <TableDataset...>, ...}From there the user can access datasets a variety of ways depending on their needs.
[4] for dataset_id, dataset in water_temp_datasets:
df = dataset.to_dataframe()
# Whatever esoteric things fisheries people do with their dataframes