Skip to content

Consider best practices for developing a controlled vocabulary #222

Open
@cthoyt

Description

@cthoyt

I just saw your publication (https://doi.org/10.1016/j.jrt.2025.100110) and was happy to see it referencing UBERON and other biomedical ontologies as inspiration / justification for construction of the data hazards vocabulary, but I found that it lacked important consideration on the best practices used to construct such ontologies and controlled vocabularies. These are two sources that these communities find very important:

Personally, I would like to see the following things:

  1. Assign each data hazard a "persistent identifier". This means the following things:
    • decide on a "prefix" that should be used when referencing entities from this controlled vocabulary in semantic web/linked data. Usually this is an acronym. For this project, maybe it should be datahazard
    • decide on a local unique identifier scheme. OBO ontologies usually use zero-padded numbers such that they are seven digit. Some other resources just use bare sequential numbers. I'd suggest using the OBO-style, so https://datahazards.com/hazards/general-hazard.html might get 0000001
  2. Use these identifiers to construct semantic web-/linked data-ready URLs
    • decide on a "persistent URL" scheme. This means that you have a URL that ends with a local unique identifier, so something like https://datahazards.com/hazards/0000001
    • once you have these things, you can also tell people they can write "compact URIs" like datahazard:0000001 that a centralized authority like the Bioregistry (disclaimer: this is my project) can expand to the URL above

I am happy to make a contribution to this repo to make this work, it looks like the single source of truth is markdown files where this metadata could get added.

Update, not sure if Sphinx supports frontmatter, so would you be open to renaming the documents in https://github.com/very-good-science/data-hazards/tree/main/site/hazards to use numeric identifiers instead? Of course there's also the possibility to build a controlled vocabulary using ad-hoc text, but these don't follow good practice (as outlined in Identifiers for the 21st century) since they make it difficult to relabel items later, difficult to internationalize, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feedbackFeedback on Data Hazards

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions