Skip to content

RFC: JSON-LD / Schema.org mapping #378

Closed
@mojodna

Description

@mojodna

As part of my work on STAC Browser, I've just merged preliminary JSON-LD support intended to facilitate indexing, searching, and display by Google Dataset Search.

I've tried to follow their guidelines, mapping Catalogs and Collections to schema.org DataCatalogs and Items to Datasets.

Catalog / Collection → DataCatalog

{
  "@context": "https://schema.org/",
  "@type": "DataCatalog",

  // required
  name: catalog.title,
  description: catalog.description, // as HTML

  // recommended
  identifier: catalog.properties["sci:doi"] || catalog.id,
  citation: catalog.properties["sci:citation"], // if available
  keywords: catalog.keywords,
  isBasedOn: catalog.url, // canonical STAC catalog URL (JSON)
  version: catalog.version,
  url: <STAC Browser URL>,
  // if available
  workExample: this.properties["sci:publications"].map(p => ({
     identifier: p.doi,
     citation: p.citation
   })),

  // if license is "proprietary"
  license: catalog.links.find(x => x.rel === "license").href,
  // if license is SPDX-compatible
  license: `https://spdx.org/licenses/${catalog.license}.html`,

  // if a spatial extent is available
  spatialCoverage = {
    "@type": "Place",
    geo: {
      "@type": "GeoShape",
      box: catalog.extent.spatial.join(" ")
    }
  },

  // if a temporal extent is available
  temporalCoverage: catalog.extent.temporal.map(x => x || "..").join("/"),

  // if a parent catalog is defined:
  isPartOf: {
    "@type": "DataCatalog",
    name: parent.title || parent.id, // if available
    isBasedOn: parent.url,
    url: <STAC Browser URL>
  },

  // for each child catalog:
  hasPart: {
    "@type": "DataCatalog",
    name: child.title,
    isBasedOn: child.url,
    url: <STAC Browser URL>
  },

  // for each referenced item:
  dataset: {
    identifier: item.id, // if available; requires loading the Item
    name: item.properties.title || item.id, // if available; requires loading the Item
    isBasedOn: item.url,
    url: <STAC Browser URL>
  }
}

providers are mapped according to roles (when multiple roles are specified, the provider is duplicated):

  • licensorcopyrightHolder
  • producerproducer
  • processorcontributor
  • hostprovider

and rendered as:

{
  // ...
  [mapped role]: {
    description: provider.description, // if available
    name: provider.name,
     url: provider.url // if available
  }
}

Item → Dataset

{
  "@context": "https://schema.org/",
  "@type": "Dataset",

  // required
  name: item.properties.title || item.id,
  description: this.properties.description, // if available

  // recommended
  identifier: item.properties["sci:doi"] || item.id,
  citation: catalog.properties["sci:citation"], // if available
  keywords: collection.keywords || rootCatalog.keywords, // inherit collection / root catalog keywords, if available
  // if license is "proprietary"
  license: [item.links, collection.links, rootCatalog.links].find(x => x.rel === "license").href,
  // if license is SPDX-compatible
  license: `https://spdx.org/licenses/${item.properties["item:license"] || collection.license || rootCatalog.license}.html`,
  isBasedOn: item.url, // canonical STAC item URL (JSON)
  url: <STAC Browser URL>,
  // if available
  workExample: this.properties["sci:publications"].map(p => ({
     identifier: p.doi,
     citation: p.citation
   })),
  image: item.assets.thumbnail,

  // for associated collections + parent catalogs
  includedInDataCatalog: {
    isBasedOn: c.href,
    url: <STAC Browser URL>
  },

  spatialCoverage: {
    "@type": "Place",
    geo: {
      "@type": "GeoShape",
      box: item.bbox.join(" ")
    }
  },

  temporalCoverage: this.properties["dtr:start_datetime"]
    ? [
        this.properties["dtr:start_datetime"],
        this.properties["dtr:end_datetime"]
      ]
        .map(x => x || "..")
        .join("/")
    : item.properties.datetime,

  // for each asset in item.assets
  distribution: {
    contentUrl: asset.href,
    fileFormat: asset.type,
    name: asset.title
  }
};

This implementation is live (with pre-rendered HTML) at https://planet.stac.cloud. Hopefully in the coming days it will be better indexed by Google (I've submitted the sitemap), including by Dataset Search, at which point we can see how well this mapping does at being rendered.

Meanwhile, the OpenLink Structured Data Sniffer extension for Chrome will extract JSON-LD to allow inspection.

Thoughts?

Refs #285

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions