GSoC 2022 ideas

Some users are looking for tools to help them assemble ERDDAP urls for use in their own workflows, while others would prefer to work at a higher, more opinionated level. I believe we can more cleanly separate functionality to help support the spectrum of erddapy users better.

Originally errdapy was meant to be a url builder only. We added the main class later and stop in between trying to support many different usage patterns via the single primary class [`ERDDAP`](https://github.com/ioos/erddapy/blob/9653982fbfabbf6b9310ec356bcfa65519564291/erddapy/erddapy.py#L68).


## Issues to address

- Users in interactive workflows are required to transform `ERDDAP` objects as they wish to connect to new datasets, for example when moving from searching a server to visiting the datasets.
- Adding constraints to a `ERDDAP` object are stateful in place changes, where as most interactive users are used to Numpy/Pandas/xarray style workflows where you can return or chain together changes.
- Switching out IO is currently non-trivial due to URL generation and data transformations being tightly coupled to IO.

## Proposed Solution

I proposed that we separate erddapy into more functional layers, roughly following the [SQLAlchemy core/ORM](https://docs.sqlalchemy.org/en/14/intro.html#overview) model.

### Core Layer

The core layer would contain two primary components of functionality: url generation & data transformation. This layer makes no choices or assumptions about IO allowing it to be reused easily.

- URL generation - Functions to generate valid URLs from bare components, such as a dataset name, format, and dictionary of constraints to `tabledap/M01_met_all.csv?time%2Cair_temperature%26air_temperature_qc%3D0%26time>%3D"2020-12-09T15%3A25%3A00.000Z"`
- Data transformation - Functions to convert a raw response (.csv, .nc, ...) into Pandas DataFrames, xarray Datasets...

### Object (or opinionated) Layer

The object (or opinionated) layer would present higher level objects for searching servers and accessing datasets with a Pandas or xarray like returning or chainable API compared to the transformational API of the current `ERDDAP` class. 
This layer uses much of the core functionality and presents it in easy to use ways with an opinion as to the access method.

Additionally if possible these objects should be serializable, so they can be pickled and passed to other processes/machines (Dask/Dagster/Prefect).

- `class ERDDAPConnection`
    - While most ERDDAP servers allow connections via a bare url, some servers may require authentication to access data.
    - `.get(url_part: str) -> bytes or str`
        - Method actually request data.
        - Uses requests by default similar to most of the current erddapy data fetching functionality. 
        - Can be overridden to use httpx, and potentially aiohttp or other async functionality, which could hopefully make anything else async compatible. (investigate [await_me_maybe](https://simonwillison.net/2020/Sep/2/await-me-maybe/))
    - `.open(url_part: str) -> fp`
        - Yields a file-like object for access (probably use [`fsspec.open`](https://filesystem-spec.readthedocs.io/en/latest/usage.html#higher-level) under the hood) for file types/tools that don't enjoy getting passed a string.
    - `@property(server) -> ERDDAPConnection`
        - Return a new `ERDDAPConnection` if trying to set a new server, or change other attributes rather than changing it in place.

For all of the remaining classes, either an `ERDDAPConnection` or a bare ERDDAP server url that will be transformed into an `ERDDAPConnection` can be passed in.

- `class ERDDAPServer`

    - `.__init__(connection: str | ERDDAPConnection)`
    - `.full_text_search(query: str) -> dict[str, ERDDAPDataset]`
        - Use the native ERDDAP full text search capabilities
        - Returns a dictionary of search results with dataset ids as keys and `ERDDAPDataset` values.
    - `.search(query: str) -> dict[str, ERDDAPDataset]`
        - Points to `.full_text_search`
    - `advanced_search(**kwargs) -> dict[str, ERDDAPDataset]`
        - Uses ERDDAPs advanced search capabilities (may return pre-filtered datasets)

- `class ERDDAPDataset`

    Base class for more focused table or grid datasets.

    - `@property(connection)`
        - Underlying `ERDDAPConnection`
    - `.get(file_type: str) -> bytes or str`
        - Requests the data using the `.connection.get()` method.
    - `.open(file_type: str) -> fp`
        - Yields a file-like object for access.
    - `.get_meta()`
        - Pulls the dataset info and caches it on the `_meta` attribute.
    - `._meta`
        - Set by `.get_meta()`
        - Passed when a setter returns a subclass.
        - `.attrs -> pd.DataFrame`- Dataframe of dataset attributes.
        - `.variables -> dict` - Dictionary of variables as keys, and maximum extent of constraints as values.
    - `@property(meta)`
        - Returns the `._meta` values, and will call `.get_meta()` if they are not already cached.
    - `@property(variables)`
        - List current variables the dataset requested from the dataset.
        - Setting `variables` returns a new `ERDDAPDataset` subclass.
        - If `_meta` is cached and an invalid variable is set, throw a `KeyError` instead of returning.
    - `@property(constraints)`
        - Returns the current constraints on the dataset.
        - Setting `contraints` returns a new `ERDDAPDataset` subclass.
        - If `_meta` is cached and an invalid constraint is set, throw a `KeyError` instead of returning.
    - `.url_segment(file_type: str) -> str`
        - Everything but the base section of the url (`http://neracoos.org/erddap/`), so `tabledap/A01_met.csv...`.
    - `.url(file_type: str) -> str`
        - Returns a URL constructed using the underlying `ERDDAPConnection` base class server info, the dataset ID, access method (tabledap/griddap), file type, variables, and constraints. 
        - This allows `ERDDAPDataset` subclasses to be used as more opinionated URL constructors while still not tying the users to an specific IO method.
        - Not guaranteed to capture all the specifics of formatting a request, such as if a server requires specific auth or headers.
    - `.to_dataset()` - Open the dataset as an xarray dataset by downloading a subset NetCDF.
    - `.opendap_dataset()` - Open the full dataset in xarray via OpenDAP.

- `class TableDataset(ERDDAPDataset)`

    - .`to_dataframe()` - Open the dataset as a Pandas DataFrame.

- `class GridDataset(ERDDAPDataset)`

### In Practice

So how do these work in practice?
Let's look at a few different scenarios.

#### Interactive Search

Lets say that a user wants to find and query all datasets on a server that contain `sea_water_temperature` data?

First they initialize their server object.
This can be done by passing in the server URL, the short name of the server, or an `ERDDAPConnection` object if authentication or IO methods need to be overridden.

```py
[1] from erddapy import ERDDAPServer

[2] server = ERDDAPServer("neracoos")
```

Then they can use the native ERDDAP full text search to find datasets.

```py
[3] water_temp_datasets = server.search("sea_water_temperature")
    water_temp_datasets

[3] {"nefsc_emolt_erddap": <TableDataset ...>, "UCONN_ARTG_WQ_BTM": <TableDataset...>, ...}
```

From there the user can access datasets a variety of ways depending on their needs.

```py
[4] for dataset_id, dataset in water_temp_datasets:
        df = dataset.to_dataframe()
        # Whatever esoteric things fisheries people do with their dataframes
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GSoC 2022 ideas #228

Issues to address

Proposed Solution

Core Layer

Object (or opinionated) Layer

In Practice

Interactive Search

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GSoC 2022 ideas #228

Description

Issues to address

Proposed Solution

Core Layer

Object (or opinionated) Layer

In Practice

Interactive Search

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions