Skip to content

Conversation

@gouttegd
Copy link
Contributor

Resolves [#359]

  • [ ] docs/ have been added/updated if necessary documentation is embedded in the LinkML schema.
  • make test has been run locally
  • [ ] tests have been added/updated (if applicable)
  • CHANGELOG.md has been updated.

If you are proposing a change to the SSSOM metadata model, you must

  • provide a full, working and valid example in examples/
  • provide a link to the related GitHub issue in the see_also field of the linkml model
  • provide a link to a valid example in the see_also field of the linkml model
  • make sure any new slot is annotated with the appropriate added_in annotation
  • run SSSOM-Py test suite against the updated model

This PR adds a new slot to the Mapping class, record_id, intended to hold a unique identifier for a given mapping.

The slot is optional, so as not to break compatibility with existing SSSOM 1.0 sets. This also means that, while we can define the slot as a “unique key” for the Mapping class, we cannot define it as the “identifier”, because in LinkML identifier slots are automatically mandatory.

The identifier is intended to be completely opaque. How to generate identifiers is left to the producers of SSSOM sets, and no meaning of any sort should be assigned to an identifier.

@gouttegd gouttegd self-assigned this May 31, 2025
@gouttegd gouttegd requested a review from matentzn May 31, 2025 19:18
Add a new slot to the Mapping class, `record_id`, intended to hold a
unique identifier for a given mapping.

The slot is optional, so as not to break compatibility with existing
SSSOM 1.0 sets. This also means that, while we can define the slot as a
"unique key" for the Mapping class, we _cannot_ define it as the
"identifier", because in LinkML identifier slots are automatically
mandatory.

The identifier is intended to be completely opaque. How to generate
identifiers is left to the producers of SSSOM sets, and no meaning of
any sort should be assigned to an identifier.

closes #359
Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things that come to mind that could make this a stronger proposal:

  1. Add a concrete use case(s) as part of the description. For example, JSKOS interoperability is a strong use case. So is supporting the creation of mapping databases like SeMRA and OXO2.
  2. Make more explicit what is meant by a "record" (does this mean only fields in a mapping, or fields that are part of a mapping + the mapping set it's in?)
  3. Add several more examples that demonstrate which fields are included in the record identity. For example, do extension fields get included in this? I think you would also want to show over multiple example SSSOM TSVs how the same row might or might not have the same record ID
  4. Documentation/guidance on how one could go about assigning IDs. Do you expect this to be just a random text field where anyone can put anything?
    • Ideally, it would be great to propose an algorithm for auto-generating an ID for a given record (something like a hash)

@gouttegd
Copy link
Contributor Author

gouttegd commented Jun 2, 2025

Add a concrete use case(s) as part of the description.

I didn’t think the need to identify one record within a set needed much elaboration, but fine.

Make more explicit what is meant by a "record"

A “record” is basically an instance of the Mapping class. But I am trying to avoid the word “mapping” because many people use that word for something else. In particular, it is very often use to refer to a triple “subject predicate object”, independently of all the SSSOM metadata.

Add several more examples that demonstrate which fields are included in the record identity.

There is no notion of “fields being included in the record identity”. The record identifier is just an opaque string that uniquely identifies one record. The contents of the record do not matter. Two records may differ only by only one slot, they are still different records and must have different identifiers. In fact two records may even be absolutely identical on all their other slots and still have different record identifiers (which would typically be the case if they originate from two different sets).

I think you would also want to show over multiple example SSSOM TSVs how the same row might or might not have the same record ID

It is explicitly said in the PR that this scenario (two records with the same ID) is an unsupported one.

Documentation/guidance on how one could go about assigning IDs. Do you expect this to be just a random text field where anyone can put anything?

Yes, that’s what is meant by “opaque” identifier. It can be any string (modulo the fact that it must be possible to turn it into a CURIE), as long as it is unique.

People are free to use whatever ID scheme they want, it’s not the role of the spec to mandate one scheme in particular. IDs can be serially allocated numbers

https://example.org/myset/rec0001
https://example.org/myset/rec0002
https://example.org/myset/rec0002

they can be randomly generated UUID

https://example.org/myset/d8097f3d-60b7-4792-8f48-080327f01f54
https://example.org/myset/272b8dd1-ca5c-4adb-acde-42564c9f04e5
https://example.org/myset/38a8fcba-5c9b-4c1c-aa13-7e9079001e45

they can be randomly picked set of words

https://example.org/myset/correct_horse_battery_staple

deally, it would be great to propose an algorithm for auto-generating an ID for a given record (something like a hash)

Maybe, but this is out of scope here.

And for the record, I am flatly against the use of identifiers that depend on the contents of what they identify.

@graybeal
Copy link

graybeal commented Jun 3, 2025

Some thoughts on the totality of the thread, to understand the rules a bit better.

Do you want record_id to be universally unique, or just locally unique to the mapping set? If the former, than this is not enough to say it would probably be a good idea to make them using the mapping set ID as a base—otherwise you can't guarantee uniqueness. If you're going to go to the trouble of putting this in, letting it be not globally unique seems a huge loss.

For maximum linkability and discoverability I'd like to request the format of the identifier be specified as + '/' + fragment. I'm OK with the fragment being a random (IRI-compatible) string, but again, if you're going to the trouble of creating an identifier field, let's standardize it appropriately the first time. Web page anchors and semantic artifacts have taught us some useful lessons.

If an identical record (ignoring the record_id) in another mapping set (file) is considered a different record, then there is no supported duplication (copying) of records—at a minimum you have to change the record_id. But we only would ever care/notice about duplication across mapping sets if the record_id is supposed to be universally unique.

Is an identical record (ignoring the record_id) in the same mapping set legal? (I think we said it was, sorry if I misremember.) If so, and if it has a record_id, it would only be legal if the record_id does not match its nominal twin, right?

Re identifiers for the triple vs the whole record: It may be helpful to consider that if you have an identifier for the whole record, it is trivial to support a variant which is identifying the triple within that record.

I really wish record_id could be a formal identifier, as in required. If that breaks previous schemas then call it a breaking schema and support both. (OK, in some worlds that argument works, maybe this isn't one of them. But otherwise you are forever locked in to all the weaknesses of your schema for, you know, forever. If it's an important enough feature then a new schema version might be worth it.) Alternatively, call it a profile, and provide a way to indicate schemas that support the profile, so that validation can be performed against the profile for a given schema.

And for the record, I am flatly against the use of identifiers that depend on the contents of what they identify.

Umm, there is a credible argument for including a checksum (validation) information in the identifier. Is that also included in your objection? I appreciate it is arguably too much complexity, but wanted to ask. In the end I'm fine with explicitly avoiding specifying the scheme for the record_id. But you know people will make the record_id depend on the content if you don't disallow that.

Re versioning, I am pretty sure the system will be much more robust if identifiers do not embed versioning information. That information about a record can be (and is) handled by another attribute.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jun 3, 2025

Do you want record_id to be universally unique, or just locally unique to the mapping set?

For the spec, what matters is uniqueness within a mapping set.

I expect many users will want to use globally unique identifiers, and they can absolutely do so, but I believe the spec should not enforce that. At most, we can promote that as a “good practice” in the documentation section of the SSSOM website, but this out of scope for the specification.

For maximum linkability and discoverability I'd like to request the format of the identifier be specified as + '/' + fragment. I'm OK with the fragment being a random (IRI-compatible) string, but again, if you're going to the trouble of creating an identifier field, let's standardize it appropriately the first time. Web page anchors and semantic artifacts have taught us some useful lessons.

Flatly disagree. The spec should not mandate anything about the “format” of the identifier. As far as the spec is concerned, identifiers have no “format”, they are opaque strings. Any attempt at enforcing a particular format will only reduce the flexibility and the usefulness of the format.

(For an example of what I mean: it’s because we treat identifiers as merely opaque string without mandating any particular format that an idea such as the “URI Expression language” is even possible; the idea would have been dead in the water if someone had decided that semantic identifiers had to follow a specific format.)

Is an identical record (ignoring the record_id) in the same mapping set legal?

It is not explicitly forbidden, so it is legal, yes.

If so, and if it has a record_id, it would only be legal if the record_id does not match its nominal twin, right?

As currently written in the present PR, yes. More precisely, 2 records in the same set with the same record_id is a undefined behaviour scenario – that is, we don’t tell implementations what they should do in that case. They can reject the set outright, they can drop the first record, drop the second record, attempt to merge the records… it’s up to them.

I am open to the idea of defining an exact behaviour to adopt (and to mandate in the spec) in that scenario, if we can agree on what the correct behaviour would be.

Re identifiers for the triple vs the whole record: It may be helpful to consider that if you have an identifier for the whole record, it is trivial to support a variant which is identifying the triple within that record.

I am on the record (pun intended) being opposed to the idea of an identifier for the subject-predicate-object triple (let’s call that the “SPO ID” here):

  • this creates more failure modes: what if two records have the same SPO ID but a different SPO triple? or the same SPO triple but a different SPO ID?
  • the SPO can already be uniquely identified by the combination of the identifiers of its three components;
  • this is neither (1) what has been requested in New slot mapping_id #359 (and needed for mapping to JSKOS) nor (2) what we need for RDF serialisation.

In any case, this is orthogonal to the idea of an identifier for an entire record. If people really want a SPO ID, they can open a new issue and make a case for it. But for now I have yet to see anyone make a real case for SPO IDs.

I really wish record_id could be a formal identifier, as in required. If that breaks previous schemas then call it a breaking schema and support both.

SSSOM has been in used for several years already and until now the lack of a required record identifier has not been seen as a problem – or at least, not problematic enough to motivate people to do something about it.

I am not flatly opposed to the idea of making record_id mandatory for SSSOM 1.1+ sets (provided we mandate that implementations SHOULD still support ID-less 1.0 sets), but I’d like to hear more from users before making such a drastic change. I know my use cases for SSSOM certainly do not require a record identifier, and in fact SSSOM would be slightly more cumbersome to use if such identifiers were mandatory.

Umm, there is a credible argument for including a checksum (validation) information in the identifier. Is that also included in your objection?

Yes. For at least two reasons (I’m pretty sure I already exposed them in #359):

  • With a content-dependent identifier, any change to the contents of the record changes the identifier, which means the identifier is not stable (you fixed a typo somewhere? your record ID has changed; oh, you already used that record ID to refer to that particular record in your database? well, I guess you need to go update your database then) – yes, there may be cases where this is actually a desired behaviour (see for example Git commit IDs), but I seriously doubt this would always be desired.
  • If the identifier is derived in any way from the contents of the record (e.g., through some kind of hashing), it means that it cannot possibly be manually created by the human editing the mapping set. Any creation or modification of a SSSOM/TSV file would require passing the file through an ID-generating step to ensure that all IDs are correctly generated. This breaks an important promise of SSSOM, namely that SSSOM/TSV files are simple files that do not require editors to use specialised tool.

In the end I'm fine with explicitly avoiding specifying the scheme for the record_id. But you know people will make the record_id depend on the content if you don't disallow that.

Yes! That’s the whole point of not specifying any scheme! For the spec, the identifier is an opaque string and nothing more.

If some users want to use content-derived IDs, they can absolutely do so. The spec will not get in their way. In fact, the spec could even define a standard algorithm to generate such a content-derived ID (the idea has been floated here #436) that implementations could then provide to the users. I have nothing against that at all. I am simply flatly opposed to mandating the use of such IDs.

Re versioning, I am pretty sure the system will be much more robust if identifiers do not embed versioning information. That information about a record can be (and is) handled by another attribute.

Fully agree.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jun 3, 2025

As far as the spec is concerned, identifiers have no “format”, they are opaque strings.

Let me correct that as it’s not quite true: we mandate at least that identifiers must be URIs, since they are typed as EntityReference.

That’s already an important deviation from the idea of completely “opaque” strings, and I am not willing to deviate anymore than that by mandating on top of it a certain format of URI.

@matentzn
Copy link
Collaborator

matentzn commented Jun 4, 2025

Great discussion all - as always it is a bit hard to follow complex intertwined discussions on the main body of the pull request; Maybe in the future we should always move discussions back to the issue discussion, especially for big changes like this one.

I only want to add two thoughts to this PR (beyond what was already suggested):

  1. To reduce the scope of this PR to the barebones spec part (identifier format, i.e. entity reference, examples), and separate best practices into a separate PR (global uniqueness, mapping set id as base, etc). I agree with most of @graybeal suggestion on promoting best practice, but I also fully agree with @gouttegd general sentiment to keep such considerations out of the spec.
  2. I am firmly against the requirement to asserting a record_id in SSSOM, either now, or in the future. ID management is a big burden for anyone producing a mapping set, and the main, original community for which SSSOM is designed is not the interoperability nerds like we are, but the scientists who need to publish a mapping as part of their work/publication in a table format. However, I would be open to a "standardised" ID generation protocol for the RDF-based serialisations. (I have no opinion on content-derived ids but would find it convenient if there was a way for two people generating record_ids for the same exact mapping set to end up having the same exact record_id in RDF (not relevant to PR)).

@gouttegd there is a lot of talk in this PR - I am not sure from your responses which you are planning to address.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jun 4, 2025

@matentzn I do plan to address some of @cthoyt comments, namely:

Add a concrete use case(s) as part of the description.

As I already said, I didn’t think that it would be necessary at all, but the fact that both you and @cthoyt did not understand what was meant by “record identifier“ makes it pretty clear that the current description is not enough.

Make more explicit what is meant by a "record" (does this mean only fields in a mapping, or fields that are part of a mapping + the mapping set it's in?)

Can we all agree that an instance of the Mapping class, or a row in a SSSOM/TSV file, can appropriately be referred to as a “mapping record”? And that we should, as much as possible, avoid talking about simply “mapping”, since the term is much too ambiguous (some people use it to refer to the SPO triple, some people use it to refer to an entire record, some people use it to refer to a “conceptual mapping” that can have several versions and therefore can be represented by several records)?

If we do agree on that, I will:

  1. quickly¹ fix the current PR to immediately clarify, in the description of record_id, what is meant by “record”;

  2. plan another PR to formally introduce the concept of “mapping record” and use it whenever appropriate in the entire spec/documentation.

If we don’t agree on that, then we might just as well close this PR for now and first fix the terminology issue once and for all. That is, we discuss and come out with a name we all agree on to refer to an instance of the Mapping class (“mapping” is not a good name IMHO), and then we can come back to creating an identifier for it.

Add several more examples that demonstrate which fields are included in the record identity

This doesn’t need to be addressed since it is irrelevant (and I hope my new description will make that clear enough). The identifier is for the entire “record”. There is no question of “which fields are included in it”.

Documentation/guidance on how one could go about assigning IDs.

Out of scope for this PR.


¹ For some values of “quickly”. Hopefully by the end of the week, but no promises.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jun 4, 2025

As for @graybeal comments:

It boils down to whether we agree that the spec should not be prescriptive about the format of identifiers and how they are generated. If we do, then there’s nothing else to say.

If we don’t, then again we might as well close this PR and postpone the creation of a record ID slot (and therefore also postpone specifying the RDF serialisation) until we agree on what SSSOM record identifiers should or should not be.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jun 4, 2025

I’d also like to point out that the PR is merely a direct implementation of what I proposed more than a year ago (modulo the name, which I changed from mapping_id to record_id - also as announced) as a solution to #359. At the time and in the following 13 months nobody raised an objection to it to that particular proposal, instead the discussion got bogged down on issues about content-derived identifiers or core-mapping-id vs record-id.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jun 4, 2025

I have no opinion on content-derived ids but would find it convenient if there was a way for two people generating record_ids for the same exact mapping set to end up having the same exact record_id in RDF

I guess we can discuss that when we come to discussing the RDF spec, but I will only agree with that if it is a mere possibility, not a requirement.

The resource identifier for the RDF representation of a mapping record should be the value of the record_id slot and that’s it. Independently of how that record_id has been generated.

Now if a set has no record_id slots, I don’t mind suggesting that implementations MAY auto-generate record_id values by deriving them from the records (though I’d much rather suggest doing something like mapping_set_id + incremental serial number).

But if a set does have record_id slots, I am flatly opposed to mandating that the RDF serialisation should discard them and use its own auto-generated IDs instead.

@graybeal
Copy link

graybeal commented Jun 4, 2025

No disagreement if it is optional. I’d lean against being prescriptive or descriptive about how it is made, Leave it for the future. I wouldn’t be directive about what tools do/can/cant/dont don’t do, either.

I think a few of my questions should be addressed by the added documentation. I agree with the statement of what a ‘mapping record’ is and that it should be a defined and used concept.

Go forth…

Complete the description of the new record_id slot to:

* clarify what a "record" is;
* dispel any misconception that several records can share the same
  record_id;
* explicitly state that record identifiers must be URIs (which should
  not be needed since this is what the `range: EntityReference` is here
  for, but just in case);
* explicitly state that record identifiers are opaque.
@gouttegd gouttegd requested a review from matentzn June 6, 2025 13:07
matentzn
matentzn previously approved these changes Jun 10, 2025
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with this proposal. To make me perfectly happy I would like to see a sentence that clarifies that we interpret slot condensing as syntactic sugar, and that a condensed slot at mapping_set level still belongs to the record (while a mapping set level slot in general does not), but maybe this does not belong in the spec.

@cthoyt what do you think? IMO the purpose of this proposal is mostly to enable a standard way to generate resource IDs in RDF, and I hope (expect) to not see these record ids so much in the wild. I think this is good.

Co-authored-by: Nico Matentzoglu <nicolas.matentzoglu@gmail.com>
@gouttegd
Copy link
Contributor Author

IMO the purpose of this proposal is mostly to enable a standard way to generate resource IDs in RDF

Not generate. Store. The only purpose of this proposal is to have a way to represent and store a record identifier (which is, among other things, needed for RDF serialisation since people seem to have issue with the idea of representing records with blank nodes in RDF). How those record identifiers are generated is explicitly out-of-scope for this PR, I thought this was clear enough.

@matentzn
Copy link
Collaborator

Not generate. Store.

Haha my bad. I do understand this proposal is about storing, not about generating. Bad choice of words

matentzn
matentzn previously approved these changes Jun 10, 2025
cthoyt
cthoyt previously approved these changes Jun 16, 2025
Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the improvement of explanation, but would still like to see more human-level documentation be part of proposals beyond the gobbledygook that comprises changing the LinkML schema.

Since @gouttegd said he would follow up with this, I think you can accept my suggestion to improve the example file then finish this

@cthoyt
Copy link
Member

cthoyt commented Jun 16, 2025

final comment though, are you sure you don't want to switch this back to being mapping_id instead of record_id? Then discussions about assigning the SPO or SPO + "not" could be about a "triple ID"

Add a new record with the same subject/predicate/triple but with a different record ID, to demonstrate that the same triple might appear in multiple records, each record getting a unique ID.

Co-authored-by: Charles Tapley Hoyt <cthoyt@gmail.com>
@gouttegd
Copy link
Contributor Author

are you sure you don't want to switch this back to being mapping_id instead of record_id?

Yes, absolutely sure.

I have already given my first reason for that in this comment (mapping_id could be mistakenly interpreted as “the ID for a SPO triple”). Overall, there is currently too much inconsistency in the way we use the term “mapping”:

  • sometimes we use “mapping” to refer to the SPO triple, and “mapping metadata” to refer to the other slots that make up the Mapping class
  • sometimes we use “core mapping” to refer to the SPO triple, and “mapping” to refer to an instance of the Mapping class.

I really think we need to clean that up (I plan to do a complete pass on the spec/documentation later to do precisely that), but for now it must start by not using an ambiguous term directly in the name of a slot.

“Record” is fairly standard terminology to refer to “an entry in a database”, “a row in a spreadsheet”, or even before computers were a thing, “a card in a card file cabinet”. I think this is exactly what we need here.

Furthermore, it is likely that people have already been using mapping_id as a custom field in their own SSSOM stuff, because it is quite an “obvious” name. I know for a fact that this is precisely the case for the EBI’s Oxo2 system. Introducing an official mapping_id slot in the spec would cause unnecessary, easily avoidable issues.

@cthoyt
Copy link
Member

cthoyt commented Jun 17, 2025

thanks for the detailed response!

@matentzn matentzn merged commit 70d7afe into master Jun 17, 2025
4 checks passed
@matentzn matentzn deleted the record-id branch June 17, 2025 11:15
cthoyt added a commit to biopragmatics/semra that referenced this pull request Oct 20, 2025
Add output of SSSOM field for record_id which was added in
mapping-commons/sssom#452. This field
corresponds to an evidence's reference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants