-
Notifications
You must be signed in to change notification settings - Fork 28
Add record_id slot.
#452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add record_id slot.
#452
Conversation
Add a new slot to the Mapping class, `record_id`, intended to hold a unique identifier for a given mapping. The slot is optional, so as not to break compatibility with existing SSSOM 1.0 sets. This also means that, while we can define the slot as a "unique key" for the Mapping class, we _cannot_ define it as the "identifier", because in LinkML identifier slots are automatically mandatory. The identifier is intended to be completely opaque. How to generate identifiers is left to the producers of SSSOM sets, and no meaning of any sort should be assigned to an identifier. closes #359
cthoyt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things that come to mind that could make this a stronger proposal:
- Add a concrete use case(s) as part of the description. For example, JSKOS interoperability is a strong use case. So is supporting the creation of mapping databases like SeMRA and OXO2.
- Make more explicit what is meant by a "record" (does this mean only fields in a mapping, or fields that are part of a mapping + the mapping set it's in?)
- Add several more examples that demonstrate which fields are included in the record identity. For example, do extension fields get included in this? I think you would also want to show over multiple example SSSOM TSVs how the same row might or might not have the same record ID
- Documentation/guidance on how one could go about assigning IDs. Do you expect this to be just a random text field where anyone can put anything?
- Ideally, it would be great to propose an algorithm for auto-generating an ID for a given record (something like a hash)
I didn’t think the need to identify one record within a set needed much elaboration, but fine.
A “record” is basically an instance of the
There is no notion of “fields being included in the record identity”. The record identifier is just an opaque string that uniquely identifies one record. The contents of the record do not matter. Two records may differ only by only one slot, they are still different records and must have different identifiers. In fact two records may even be absolutely identical on all their other slots and still have different record identifiers (which would typically be the case if they originate from two different sets).
It is explicitly said in the PR that this scenario (two records with the same ID) is an unsupported one.
Yes, that’s what is meant by “opaque” identifier. It can be any string (modulo the fact that it must be possible to turn it into a CURIE), as long as it is unique. People are free to use whatever ID scheme they want, it’s not the role of the spec to mandate one scheme in particular. IDs can be serially allocated numbers they can be randomly generated UUID they can be randomly picked set of words
Maybe, but this is out of scope here. And for the record, I am flatly against the use of identifiers that depend on the contents of what they identify. |
|
Some thoughts on the totality of the thread, to understand the rules a bit better. Do you want record_id to be universally unique, or just locally unique to the mapping set? If the former, than this is not enough to say For maximum linkability and discoverability I'd like to request the format of the identifier be specified as + '/' + fragment. I'm OK with the fragment being a random (IRI-compatible) string, but again, if you're going to the trouble of creating an identifier field, let's standardize it appropriately the first time. Web page anchors and semantic artifacts have taught us some useful lessons. If an identical record (ignoring the record_id) in another mapping set (file) is considered a different record, then there is no supported duplication (copying) of records—at a minimum you have to change the record_id. But we only would ever care/notice about duplication across mapping sets if the record_id is supposed to be universally unique. Is an identical record (ignoring the record_id) in the same mapping set legal? (I think we said it was, sorry if I misremember.) If so, and if it has a record_id, it would only be legal if the record_id does not match its nominal twin, right? Re identifiers for the triple vs the whole record: It may be helpful to consider that if you have an identifier for the whole record, it is trivial to support a variant which is identifying the triple within that record. I really wish record_id could be a formal identifier, as in required. If that breaks previous schemas then call it a breaking schema and support both. (OK, in some worlds that argument works, maybe this isn't one of them. But otherwise you are forever locked in to all the weaknesses of your schema for, you know, forever. If it's an important enough feature then a new schema version might be worth it.) Alternatively, call it a profile, and provide a way to indicate schemas that support the profile, so that validation can be performed against the profile for a given schema.
Umm, there is a credible argument for including a checksum (validation) information in the identifier. Is that also included in your objection? I appreciate it is arguably too much complexity, but wanted to ask. In the end I'm fine with explicitly avoiding specifying the scheme for the record_id. But you know people will make the record_id depend on the content if you don't disallow that. Re versioning, I am pretty sure the system will be much more robust if identifiers do not embed versioning information. That information about a record can be (and is) handled by another attribute. |
For the spec, what matters is uniqueness within a mapping set. I expect many users will want to use globally unique identifiers, and they can absolutely do so, but I believe the spec should not enforce that. At most, we can promote that as a “good practice” in the documentation section of the SSSOM website, but this out of scope for the specification.
Flatly disagree. The spec should not mandate anything about the “format” of the identifier. As far as the spec is concerned, identifiers have no “format”, they are opaque strings. Any attempt at enforcing a particular format will only reduce the flexibility and the usefulness of the format. (For an example of what I mean: it’s because we treat identifiers as merely opaque string without mandating any particular format that an idea such as the “URI Expression language” is even possible; the idea would have been dead in the water if someone had decided that semantic identifiers had to follow a specific format.)
It is not explicitly forbidden, so it is legal, yes.
As currently written in the present PR, yes. More precisely, 2 records in the same set with the same I am open to the idea of defining an exact behaviour to adopt (and to mandate in the spec) in that scenario, if we can agree on what the correct behaviour would be.
I am on the record (pun intended) being opposed to the idea of an identifier for the subject-predicate-object triple (let’s call that the “SPO ID” here):
In any case, this is orthogonal to the idea of an identifier for an entire record. If people really want a SPO ID, they can open a new issue and make a case for it. But for now I have yet to see anyone make a real case for SPO IDs.
SSSOM has been in used for several years already and until now the lack of a required record identifier has not been seen as a problem – or at least, not problematic enough to motivate people to do something about it. I am not flatly opposed to the idea of making
Yes. For at least two reasons (I’m pretty sure I already exposed them in #359):
Yes! That’s the whole point of not specifying any scheme! For the spec, the identifier is an opaque string and nothing more. If some users want to use content-derived IDs, they can absolutely do so. The spec will not get in their way. In fact, the spec could even define a standard algorithm to generate such a content-derived ID (the idea has been floated here #436) that implementations could then provide to the users. I have nothing against that at all. I am simply flatly opposed to mandating the use of such IDs.
Fully agree. |
Let me correct that as it’s not quite true: we mandate at least that identifiers must be URIs, since they are typed as EntityReference. That’s already an important deviation from the idea of completely “opaque” strings, and I am not willing to deviate anymore than that by mandating on top of it a certain format of URI. |
|
Great discussion all - as always it is a bit hard to follow complex intertwined discussions on the main body of the pull request; Maybe in the future we should always move discussions back to the issue discussion, especially for big changes like this one. I only want to add two thoughts to this PR (beyond what was already suggested):
@gouttegd there is a lot of talk in this PR - I am not sure from your responses which you are planning to address. |
|
@matentzn I do plan to address some of @cthoyt comments, namely:
As I already said, I didn’t think that it would be necessary at all, but the fact that both you and @cthoyt did not understand what was meant by “record identifier“ makes it pretty clear that the current description is not enough.
Can we all agree that an instance of the If we do agree on that, I will:
If we don’t agree on that, then we might just as well close this PR for now and first fix the terminology issue once and for all. That is, we discuss and come out with a name we all agree on to refer to an instance of the
This doesn’t need to be addressed since it is irrelevant (and I hope my new description will make that clear enough). The identifier is for the entire “record”. There is no question of “which fields are included in it”.
Out of scope for this PR. ¹ For some values of “quickly”. Hopefully by the end of the week, but no promises. |
|
As for @graybeal comments: It boils down to whether we agree that the spec should not be prescriptive about the format of identifiers and how they are generated. If we do, then there’s nothing else to say. If we don’t, then again we might as well close this PR and postpone the creation of a record ID slot (and therefore also postpone specifying the RDF serialisation) until we agree on what SSSOM record identifiers should or should not be. |
|
I’d also like to point out that the PR is merely a direct implementation of what I proposed more than a year ago (modulo the name, which I changed from |
I guess we can discuss that when we come to discussing the RDF spec, but I will only agree with that if it is a mere possibility, not a requirement. The resource identifier for the RDF representation of a mapping record should be the value of the Now if a set has no But if a set does have |
|
No disagreement if it is optional. I’d lean against being prescriptive or descriptive about how it is made, Leave it for the future. I wouldn’t be directive about what tools do/can/cant/dont don’t do, either. I think a few of my questions should be addressed by the added documentation. I agree with the statement of what a ‘mapping record’ is and that it should be a defined and used concept. Go forth… |
Complete the description of the new record_id slot to: * clarify what a "record" is; * dispel any misconception that several records can share the same record_id; * explicitly state that record identifiers must be URIs (which should not be needed since this is what the `range: EntityReference` is here for, but just in case); * explicitly state that record identifiers are opaque.
matentzn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with this proposal. To make me perfectly happy I would like to see a sentence that clarifies that we interpret slot condensing as syntactic sugar, and that a condensed slot at mapping_set level still belongs to the record (while a mapping set level slot in general does not), but maybe this does not belong in the spec.
@cthoyt what do you think? IMO the purpose of this proposal is mostly to enable a standard way to generate resource IDs in RDF, and I hope (expect) to not see these record ids so much in the wild. I think this is good.
Co-authored-by: Nico Matentzoglu <nicolas.matentzoglu@gmail.com>
Not generate. Store. The only purpose of this proposal is to have a way to represent and store a record identifier (which is, among other things, needed for RDF serialisation since people seem to have issue with the idea of representing records with blank nodes in RDF). How those record identifiers are generated is explicitly out-of-scope for this PR, I thought this was clear enough. |
Haha my bad. I do understand this proposal is about storing, not about generating. Bad choice of words |
cthoyt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with the improvement of explanation, but would still like to see more human-level documentation be part of proposals beyond the gobbledygook that comprises changing the LinkML schema.
Since @gouttegd said he would follow up with this, I think you can accept my suggestion to improve the example file then finish this
|
final comment though, are you sure you don't want to switch this back to being |
Add a new record with the same subject/predicate/triple but with a different record ID, to demonstrate that the same triple might appear in multiple records, each record getting a unique ID. Co-authored-by: Charles Tapley Hoyt <cthoyt@gmail.com>
Yes, absolutely sure. I have already given my first reason for that in this comment (
I really think we need to clean that up (I plan to do a complete pass on the spec/documentation later to do precisely that), but for now it must start by not using an ambiguous term directly in the name of a slot. “Record” is fairly standard terminology to refer to “an entry in a database”, “a row in a spreadsheet”, or even before computers were a thing, “a card in a card file cabinet”. I think this is exactly what we need here. Furthermore, it is likely that people have already been using |
|
thanks for the detailed response! |
Add output of SSSOM field for record_id which was added in mapping-commons/sssom#452. This field corresponds to an evidence's reference.
Resolves [#359]
[ ]documentation is embedded in the LinkML schema.docs/have been added/updated if necessarymake testhas been run locally[ ] tests have been added/updated (if applicable)If you are proposing a change to the SSSOM metadata model, you must
examples/see_alsofield of the linkml modelsee_alsofield of the linkml modeladded_inannotationThis PR adds a new slot to the
Mappingclass,record_id, intended to hold a unique identifier for a given mapping.The slot is optional, so as not to break compatibility with existing SSSOM 1.0 sets. This also means that, while we can define the slot as a “unique key” for the Mapping class, we cannot define it as the “identifier”, because in LinkML identifier slots are automatically mandatory.
The identifier is intended to be completely opaque. How to generate identifiers is left to the producers of SSSOM sets, and no meaning of any sort should be assigned to an identifier.