Skip to content

Deprecate remaining package metadata and add bulk data format #1084

@jpmckinney

Description

@jpmckinney

There are issues with packages as discussed in open-contracting/infrastructure#89, #605, and CRM-4282 (all relevant comments reflected here), and the current packaging formats confer very few benefits.

Benefits

The benefits of the current packaging format are:

  1. A standardized way to publish multiple releases/records as a single file
  2. Easy access to metadata:
    • publisher
    • version and extensions
    • license and publicationPolicy

A package also sets uri and publishedDate, but this is metadata about the package itself, not about the releases/records it contains.

Discussion

Metadata

Regarding license and publicationPolicy, paraphrasing open-contracting/infrastructure#89:

  • License and publication policy metadata are important, but it isn't critical that they be distributed as data; that said, they can be expressed in the machine-readable description of the OCDS dataset in a data registry, using DCAT for example (DCAT has a property for license, and a property for publication policy can be added as an extension, which DCAT-US does with other properties).
  • Most open data (CSVs, etc.) have no means of declaring their license or publication policy, but this poses no major problem to reuse – these are instead declared on the HTML pages that serve or link to the data. Users generally only need to refer to these once, so it's not a challenge to data workflows.

See similar comments in #325 (comment)

As such, all metadata provided by the package can be omitted or moved to the release-level, without major issue.

Format

We still want a standardized way to publish multiple releases/records as a single file. A minimal package in the current format with all metadata removed would be:

{
  "releases": [
    // big list of releases
  ]
}

The problem with this format is that naive applications will load the entire file into memory. Because bulk download OCDS files can be very large (GBs), doing so exhausts memory on much consumer hardware. Iterative JSON parsers like ijson can be used to index to the releases array and yield one release at a time (as is done in OCDS Kit, for example); however, relatively few users are aware of such libraries, and many common data analysis tools don't use them (Pandas, for example). Indeed, no OCDS software written by ODS uses iterative parsing, leading to memory being exhausted in critical tools like the Data Review Tool on medium-to-large datasets; retrofitting these tools to parse iteratively is not trivial.

Any JSON format that puts releases/records in JSON arrays will suffer the same issue. The only reasonable options are:

  1. Line-delimited JSON
  2. ZIP files containing individual releases/records

There are other JSON streaming options besides line-delimited JSON, but:

  1. Line-delimited JSON has the widest support and is easy to publish and use, using common JSON libraries
  2. Record separator-delimited JSON is an eccentric format that uses rarely-used record separator characters
  3. Concatenated JSON requires specialized JSON libraries

An advantage of a ZIP file is that it can contain additional information, e.g. a LICENSE.txt or publicationPolicy.pdf. However, OCDS datasets can contain millions of releases/records. Unless the publisher organizes them into directories somehow, the ZIP file will expand into millions of files, which is a barrier to use for many users.

A single (large) line-delimited JSON file is comparatively easier to work with.

Proposal

Deprecate packages, and recommend publication of OCDS releases/records as line-delimited JSON.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Focus - PackagesRelating to release packages and record packagesSchemaRelating to other changes in the JSON Schema (renamed fields, schema properties, etc.)

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions