Skip to content

GSoC 2022 ideas #228

@ocefpaf

Description

@ocefpaf

Some users are looking for tools to help them assemble ERDDAP urls for use in their own workflows, while others would prefer to work at a higher, more opinionated level. I believe we can more cleanly separate functionality to help support the spectrum of erddapy users better.

Originally errdapy was meant to be a url builder only. We added the main class later and stop in between trying to support many different usage patterns via the single primary class ERDDAP.

Issues to address

  • Users in interactive workflows are required to transform ERDDAP objects as they wish to connect to new datasets, for example when moving from searching a server to visiting the datasets.
  • Adding constraints to a ERDDAP object are stateful in place changes, where as most interactive users are used to Numpy/Pandas/xarray style workflows where you can return or chain together changes.
  • Switching out IO is currently non-trivial due to URL generation and data transformations being tightly coupled to IO.

Proposed Solution

I proposed that we separate erddapy into more functional layers, roughly following the SQLAlchemy core/ORM model.

Core Layer

The core layer would contain two primary components of functionality: url generation & data transformation. This layer makes no choices or assumptions about IO allowing it to be reused easily.

  • URL generation - Functions to generate valid URLs from bare components, such as a dataset name, format, and dictionary of constraints to tabledap/M01_met_all.csv?time%2Cair_temperature%26air_temperature_qc%3D0%26time>%3D"2020-12-09T15%3A25%3A00.000Z"
  • Data transformation - Functions to convert a raw response (.csv, .nc, ...) into Pandas DataFrames, xarray Datasets...

Object (or opinionated) Layer

The object (or opinionated) layer would present higher level objects for searching servers and accessing datasets with a Pandas or xarray like returning or chainable API compared to the transformational API of the current ERDDAP class.
This layer uses much of the core functionality and presents it in easy to use ways with an opinion as to the access method.

Additionally if possible these objects should be serializable, so they can be pickled and passed to other processes/machines (Dask/Dagster/Prefect).

  • class ERDDAPConnection
    • While most ERDDAP servers allow connections via a bare url, some servers may require authentication to access data.
    • .get(url_part: str) -> bytes or str
      • Method actually request data.
      • Uses requests by default similar to most of the current erddapy data fetching functionality.
      • Can be overridden to use httpx, and potentially aiohttp or other async functionality, which could hopefully make anything else async compatible. (investigate await_me_maybe)
    • .open(url_part: str) -> fp
      • Yields a file-like object for access (probably use fsspec.open under the hood) for file types/tools that don't enjoy getting passed a string.
    • @property(server) -> ERDDAPConnection
      • Return a new ERDDAPConnection if trying to set a new server, or change other attributes rather than changing it in place.

For all of the remaining classes, either an ERDDAPConnection or a bare ERDDAP server url that will be transformed into an ERDDAPConnection can be passed in.

  • class ERDDAPServer

    • .__init__(connection: str | ERDDAPConnection)
    • .full_text_search(query: str) -> dict[str, ERDDAPDataset]
      • Use the native ERDDAP full text search capabilities
      • Returns a dictionary of search results with dataset ids as keys and ERDDAPDataset values.
    • .search(query: str) -> dict[str, ERDDAPDataset]
      • Points to .full_text_search
    • advanced_search(**kwargs) -> dict[str, ERDDAPDataset]
      • Uses ERDDAPs advanced search capabilities (may return pre-filtered datasets)
  • class ERDDAPDataset

    Base class for more focused table or grid datasets.

    • @property(connection)
      • Underlying ERDDAPConnection
    • .get(file_type: str) -> bytes or str
      • Requests the data using the .connection.get() method.
    • .open(file_type: str) -> fp
      • Yields a file-like object for access.
    • .get_meta()
      • Pulls the dataset info and caches it on the _meta attribute.
    • ._meta
      • Set by .get_meta()
      • Passed when a setter returns a subclass.
      • .attrs -> pd.DataFrame- Dataframe of dataset attributes.
      • .variables -> dict - Dictionary of variables as keys, and maximum extent of constraints as values.
    • @property(meta)
      • Returns the ._meta values, and will call .get_meta() if they are not already cached.
    • @property(variables)
      • List current variables the dataset requested from the dataset.
      • Setting variables returns a new ERDDAPDataset subclass.
      • If _meta is cached and an invalid variable is set, throw a KeyError instead of returning.
    • @property(constraints)
      • Returns the current constraints on the dataset.
      • Setting contraints returns a new ERDDAPDataset subclass.
      • If _meta is cached and an invalid constraint is set, throw a KeyError instead of returning.
    • .url_segment(file_type: str) -> str
      • Everything but the base section of the url (http://neracoos.org/erddap/), so tabledap/A01_met.csv....
    • .url(file_type: str) -> str
      • Returns a URL constructed using the underlying ERDDAPConnection base class server info, the dataset ID, access method (tabledap/griddap), file type, variables, and constraints.
      • This allows ERDDAPDataset subclasses to be used as more opinionated URL constructors while still not tying the users to an specific IO method.
      • Not guaranteed to capture all the specifics of formatting a request, such as if a server requires specific auth or headers.
    • .to_dataset() - Open the dataset as an xarray dataset by downloading a subset NetCDF.
    • .opendap_dataset() - Open the full dataset in xarray via OpenDAP.
  • class TableDataset(ERDDAPDataset)

    • .to_dataframe() - Open the dataset as a Pandas DataFrame.
  • class GridDataset(ERDDAPDataset)

In Practice

So how do these work in practice?
Let's look at a few different scenarios.

Interactive Search

Lets say that a user wants to find and query all datasets on a server that contain sea_water_temperature data?

First they initialize their server object.
This can be done by passing in the server URL, the short name of the server, or an ERDDAPConnection object if authentication or IO methods need to be overridden.

[1] from erddapy import ERDDAPServer

[2] server = ERDDAPServer("neracoos")

Then they can use the native ERDDAP full text search to find datasets.

[3] water_temp_datasets = server.search("sea_water_temperature")
    water_temp_datasets

[3] {"nefsc_emolt_erddap": <TableDataset ...>, "UCONN_ARTG_WQ_BTM": <TableDataset...>, ...}

From there the user can access datasets a variety of ways depending on their needs.

[4] for dataset_id, dataset in water_temp_datasets:
        df = dataset.to_dataframe()
        # Whatever esoteric things fisheries people do with their dataframes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions