Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections endpoint #386

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
Open
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 33 additions & 2 deletions optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SH
For example, a :entry:`structures` entry is comprised by data that pertain to a single structure.

**Entry type**
Entries are categorized into types, e.g., :entry:`structures`, :entry:`calculations`, :entry:`references`.
Entries are categorized into types, e.g., :entry:`structures`, :entry:`calculations`, :entry:`references`, :entry:`collections`.
merkys marked this conversation as resolved.
Show resolved Hide resolved
Entry types MUST be named according to the rules for identifiers.

**Entry property**
Expand Down Expand Up @@ -196,6 +196,9 @@ The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SH
This is particularly relevant for the default JSON-based response format.
In this case, **field** refers to the name part of the name-value pairs of JSON objects.

**Collection**
A Collection defines a relationship between a group of Entry resources. A Collection can be used to store metadata that applies to all of the entries in the group, and to aggregate metadata from each entry in the group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I would define collections here, as the list is primarily external defintions of terminology that we are going to use in OPTIMADE, not terminology we are really defining ourselves (e.g. structure is not in this list)


Data types
----------

Expand Down Expand Up @@ -323,7 +326,7 @@ Index Meta-Database
A database provider MAY publish a special Index Meta-Database base URL. The main purpose of this base URL is to allow for automatic discoverability of all databases of the provider. Thus, it acts as a meta-database for the database provider's implementation(s).

The index meta-database MUST only provide the :endpoint:`info` and :endpoint:`links` endpoints, see sections `Info Endpoints`_ and `Links Endpoint`_.
It MUST NOT expose any entry listing endpoints (e.g., :endpoint:`structures`).
It MUST NOT expose any entry listing endpoints (e.g., :endpoint:`structures` and :endpoint:`collections`).
merkys marked this conversation as resolved.
Show resolved Hide resolved

These endpoints do not need to be queryable, i.e., they MAY be provided as static JSON files.
However, they MUST return the correct and updated information on all currently provided implementations.
Expand Down Expand Up @@ -1075,6 +1078,7 @@ Example:
},
"available_endpoints": [
"structures",
"collections",
"calculations",
"info",
"links"
Expand Down Expand Up @@ -2318,6 +2322,33 @@ structure\_features

- A structure having implicit atoms and using assemblies: :val:`["assemblies", "implicit_atoms"]`

Collections Entries
-------------------
A Collection is used to define groups of Entry resources. It can be used to store metadata that applies to all of the entries in the group, or metadata that is generated by aggregating fields from each of the entries in the group. The group of entries is defined using :field:`relationships` as described in the `Relationships`_ section.

An example use case would be to define a relationship between a collection of Structure entries that are all conceptually related (e.g., "A collection of FCC Al structures containing a single vacancy defect").

:entry:`collections` entries have the properties described in the section `Properties Used by Multiple Entry Types`_ as well as the following properties: `additional_metadata`_ and `aggregated_fields`_.

additional_metadata
~~~~~~~~~~~~~~~~~~~
- **Description**: Additional metadata that applies to all of the entries in `relationships`.
- **Type**: a dictionary
- **Requirements/Conventions**:
- **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`.
- **Query**: support for queries on this property is OPTIONAL. If supported, only a subset of the filter features MAY be supported.
- The keys should be short strings describing the type of metadata being supplied.
- The values can be any string, which may be human-readable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to go down the route of some defined (optional) fields for e.g. description, name, then let provider-specific fields do the rest of the work (which can then be described in /info/collections), e.g. _exmpl_dft_parameters if the collection defines a consistent set of DFT calculations.


aggregated_fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe properties instead of fields, or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need an example to see what this field is used for, this just aggregates field/property names but not values? Does every entry have to have a value of each field listed here?

Copy link
Member Author

@jvita jvita Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had only said "fields" (or "properties" seems fine too) instead of also including values because:

  1. I wasn't sure what the best way would be to specify the reduction operations over the values (for example, sometimes you might want to sum the values, other times you might want to build a set from a list of values, etc.)
  2. I wasn't sure when the values should be reduced. Should the reduction occur before the collection is uploaded, meaning the reduced values wouldn't change even if the linked entries were edited? This wouldn't seem ideal, but I also don't know if it's acceptable (or possible) to specify that the reduction would be performed every time the collection was accessed.

A basic example of this, which we've been using for the OpenKIM/ColabFit project, is to have a "StructuresCollection" that aggregates all of the attributes.elements fields of the linked structures to provide a single set of elements present in the collection. Something like structure1.attributes.elements = ['C', 'Fe'], structure2.attributes.elements = ['Al'], collection.attributes.elements = ['Al', 'C', 'Fe']. Another simple example would be to aggregate attributes.nsites to count the total number of sites in the collection.

Does every entry have to have a value of each field listed here?

Though it's a bit restrictive, I think that I'd say yes, every entry should have a value for each of the aggregated fields. I think that a collection should be assumed to be homogenous, but perhaps that could use some discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should update the collection every time one of the entries gets updated.
If entries get updated regularly, you could place relationships in each entry pointing to the collections they belong to. If this rarely happens, you could probably query all the collections to check whether a particular entry is in that collection and then update it. We don't have to specify how to update the data belonging to the collections in the specification though, only that it SHOULD be updated.
In some cases it could however be worth while to create a new structure rather than to update the existing one. For example, when you want a collection you refer to in an article to stay the same.

Perhaps you could make a dictionary for each Optimade property.
Which could, depending on the property, hold a set or the minimum, average and maximum value in the collection.

When making the properties for these collections I think it would be good to think about how you would search for collections.

The number of entries in your collection would probably also be a good property to include.

There is also the info endpoint where you can specify which properties are shared for each endpoint
For collections, it would be /info/collections. You therefore do not have to specify which properties are available for the collections. (I do have a field like that in the trajectories endpoint because in that case the fields do not need to be queryable.) If they are queryable you could use the IS KNOWN query to check whether an entry has the particular field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have difficulty seeing the utility of aggregated_fields being an OPTIMADE standardized property. There is of course no problem for OpenKIM serving, e.g., an _openkim_aggregated_elements that aggregates the values of the elements, etc.; but it just seems the definition means this field anyway needs to be interpreted differently depending on which database is being queried.

~~~~~~~~~~~~~~~~~~~~~~
- **Description**: Names of fields that were generated by aggregating over the corresponding fields in each of the entries specified in `relationships`.
- **Type**: a list of strings
- **Requirements/Conventions**:
- **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`.
- **Query**: support for queries on this property is OPTIONAL. If supported, only a subset of the filter features MAY be supported.
- Strings provided in this list should correspond to other queryable fields within the `collections` entry.

Calculations Entries
--------------------

Expand Down