Consider best practices for developing a controlled vocabulary

I just saw your publication (https://doi.org/10.1016/j.jrt.2025.100110) and was happy to see it referencing UBERON and other biomedical ontologies as inspiration / justification for construction of the data hazards vocabulary, but I found that it lacked important consideration on the best practices used to construct such ontologies and controlled vocabularies. These are two sources that these communities find very important:

- [OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies (`doi:10.1093/database/baab069`)](https://doi.org/10.1093/database/baab069) and the actively maintained web-based overview of the [OBO Foundry Principles](https://obofoundry.org/principles/fp-000-summary.html)
- [Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data (`doi:10.1371/journal.pbio.2001414`)](https://doi.org/10.1371/journal.pbio.2001414)

Personally, I would like to see the following things:

1. Assign each data hazard a "persistent identifier". This means the following things:
   - decide on a "prefix" that should be used when referencing entities from this controlled vocabulary in semantic web/linked data. Usually this is an acronym. For this project, maybe it should be `datahazard`
   - decide on a local unique identifier scheme. OBO ontologies usually use zero-padded numbers such that they are seven digit. Some other resources just use bare sequential numbers. I'd suggest using the OBO-style, so https://datahazards.com/hazards/general-hazard.html might get `0000001`
2. Use these identifiers to construct semantic web-/linked data-ready URLs
   - decide on a "persistent URL" scheme. This means that you have a URL that ends with a local unique identifier, so something like https://datahazards.com/hazards/0000001
   - once you have these things, you can also tell people they can write "compact URIs" like `datahazard:0000001` that a centralized authority like the [Bioregistry](https://bioregistry.io/) (disclaimer: this is my project) can expand to the URL above

I am happy to make a contribution to this repo to make this work, it looks like the single source of truth is markdown files where this metadata could get added. 

Update, not sure if Sphinx supports frontmatter, so would you be open to renaming the documents in https://github.com/very-good-science/data-hazards/tree/main/site/hazards to use numeric identifiers instead? Of course there's also the possibility to build a controlled vocabulary using ad-hoc text, but these don't follow good practice (as outlined in Identifiers for the 21st century) since they make it difficult to relabel items later, difficult to internationalize, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider best practices for developing a controlled vocabulary #222

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider best practices for developing a controlled vocabulary #222

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions