Description
I just saw your publication (https://doi.org/10.1016/j.jrt.2025.100110) and was happy to see it referencing UBERON and other biomedical ontologies as inspiration / justification for construction of the data hazards vocabulary, but I found that it lacked important consideration on the best practices used to construct such ontologies and controlled vocabularies. These are two sources that these communities find very important:
- OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies (
doi:10.1093/database/baab069
) and the actively maintained web-based overview of the OBO Foundry Principles - Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data (
doi:10.1371/journal.pbio.2001414
)
Personally, I would like to see the following things:
- Assign each data hazard a "persistent identifier". This means the following things:
- decide on a "prefix" that should be used when referencing entities from this controlled vocabulary in semantic web/linked data. Usually this is an acronym. For this project, maybe it should be
datahazard
- decide on a local unique identifier scheme. OBO ontologies usually use zero-padded numbers such that they are seven digit. Some other resources just use bare sequential numbers. I'd suggest using the OBO-style, so https://datahazards.com/hazards/general-hazard.html might get
0000001
- decide on a "prefix" that should be used when referencing entities from this controlled vocabulary in semantic web/linked data. Usually this is an acronym. For this project, maybe it should be
- Use these identifiers to construct semantic web-/linked data-ready URLs
- decide on a "persistent URL" scheme. This means that you have a URL that ends with a local unique identifier, so something like https://datahazards.com/hazards/0000001
- once you have these things, you can also tell people they can write "compact URIs" like
datahazard:0000001
that a centralized authority like the Bioregistry (disclaimer: this is my project) can expand to the URL above
I am happy to make a contribution to this repo to make this work, it looks like the single source of truth is markdown files where this metadata could get added.
Update, not sure if Sphinx supports frontmatter, so would you be open to renaming the documents in https://github.com/very-good-science/data-hazards/tree/main/site/hazards to use numeric identifiers instead? Of course there's also the possibility to build a controlled vocabulary using ad-hoc text, but these don't follow good practice (as outlined in Identifiers for the 21st century) since they make it difficult to relabel items later, difficult to internationalize, etc.