Skip to content

Conversation

@gouttegd
Copy link
Contributor

This PR implements option ”A” mentioned in #628. That is, it replaces the LinkML-based code for serialising/deserialising to/from RDF by a completely custom code that implements the new SSSOM/RDF spec exactly as we want it.

All the new code is contained within the new sssom.rdf_internal module, which as the name implies is not really supposed to be used outside of SSSOM-Py (though of course people can still do whatever they want!). The public interface to read from RDF is still the from_sssom_rdf method in sssom.parsers, and the public interface to write to RDF is still the to_rdf_graph method in sssom.writers. Both methods are rewritten to use the MappingSetRDFConverter class of the sssom.rdf_internal module, instead of methods from the LinkML runtime.

closes #628

Since the normalised variant of SSSOM/RDF does not match the output
produced (respectively expected) by the LinkML runtime's dumper (resp.
loader), we implement our own serialisation/deserialisation routines for
RDF.

All the code for RDF import/export is contained within the
sssom.rdf_internal module. The only class expected to be used outside of
that module is the MappingSetRDFConverter class.

The `from_sssom_rdf` and `to_rdf_graph` functions in the sssom.parsers
and sssom.writers modules are rewritten to use that class.
The new RDF converter revealed several issues with our existing tests,
which are fixed here.

* test_convert.py (test_to_rdf_hydrated): That test used an incorrect
  "NOT" value for a predicate modifier slot (the correct value is "Not",
  and there is nothing in the spec to suggest it is case-insensitive);
  it also forgot to declare some prefixes;

* test_writers.py (test_rdflib_endpoint): That test used
  "sssom:MappingCuration" as a column name in a MappingSetDataFrame,
  instead of the expected "mapping_justification"; it also forgot to
  declare some prefixes.
Add a couple of tests to check the behaviour of the new RDF parser.
Also fix an issue (revealed by one of those tests) with the RECORD_ID
slot, which must be compressed upon parsing since it is an
EntityReference-typed slot.
Use the latest pre-release of the SSSOM schema, which includes the
changes related to the latest draft of the SSSOM/RDF specification.
@cthoyt
Copy link
Member

cthoyt commented Oct 29, 2025

@gouttegd there are a few parts to the control flow I think could be improved, do you mind If I push some updates directly to your branch? otherwise I'll PR it to your branch

def from_rdf(self, obj: Node) -> str:
"""Convert a RDF node into an entity reference value."""
value = super().from_rdf(obj)
return self.ccp().compress(value, passthrough=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider passthrough=True is a bandage for end-users of curies who have very poorly formatted data and Just Want It To Work™. I think we should more explicitly handle what happens when the value either 1) isn't a valid URI or 2) can't be compressed for some reason based on the converter. I would prefer to throw exceptions in both scenarios, which could get caught by code that calls this and decide whether to be strict or not about swallowing them or bubbling them up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I principle I agree, and in fact I initially did not use passthrough=True, but then this caused the test suite to fail, because some of the very sets that we use in our own test suite apparently contain poorly formatted data – which was accepted without errors or warnings by the current RDF parser and writer.

I think the new code should be, at least initially, as tolerant as the old one. If the new code refuses a set that the old one let pass, I’d consider that a bug of the new code (and this is why I had to use passthrough=True).

I wouldn’t be opposed to making the code more strict, but that would belong to another PR IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn’t be opposed to making the code more strict, but that would belong to another PR IMHO.

To be clear: I would in fact be very much in favour of making the code more strict. But:

  • This would break existing code or sets and must therefore be considered carefully (I know SSSOM-Py is still in 0.x and therefore we are allowed to break things as much as we want; doesn’t mean we should do it lightly).
  • This would require changes to more than just the RDF parser/writer.

Two reasons why this would belong to a distinct, dedicated PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, better handle in dedicated PR


def __init__(self) -> None:
"""Create a new instance."""
self.constructors = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a module-level variable

Copy link
Contributor Author

@gouttegd gouttegd Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to avoid module-level variable if possible. Such a variable is initialised as soon as the module is loaded.

Suppose you are doing this:

from sssom.parsers import parse_sssom_table
from sssom.writers import write_json

msdf = parse_sssom_table("myset.sssom.tsv")
write_json(msdf, "myset.sssom.json")

Since you are loading the sssom.parsers module, you are also loading the sssom.rdf_internal module, and initialising any module-level variables in that module – even though you never even intended to do anything related to RDF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it's a legitimate concern given it's a small dictionary with elements that are already interpreted on import time anyway, but if you're insistent, then you can always wrap in a functionan call like:

from functools import lru_cache

@lru_cache
def _get_my_data():
     return {}

The point being that this class could just be the create function all by itself

@gouttegd
Copy link
Contributor Author

do you mind If I push some updates directly to your branch?

Err, as a matter of fact I do mind. I’d like to see your proposed updates before they end up on my branch. Sorry, but distributed revision control systems like Git were invented for a reason after all. :)

# Conversion from RDF
#

def msdf_from_rdf(
Copy link
Member

@cthoyt cthoyt Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should either be module level, or a classmethod that constructs a MappingSetRDFConverter() inside it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make the MappingSetRDFConverter class a “one-shot“ class: you create an instance to convert one graph into a MSDF (or one MSDF into a graph), and then it is spent, if you want to convert another object you must create another instance.

Yes, for the sssom.parsers.sssom_from_rdf and sssom.writers.to_rdf_graph functions within SSSOM-Py this is all we need, but I’d rather keep the MappingSetRDFConverter class reusable for multiple conversions for those who might need that.

No objection to add a module-level msdf_from_rdf function that does nothing but create an instance and immediately call the msdf_from_rdf method on it (that’s basically what sssom.parsers.sssom_from_rdf is already doing), as a convenience for those who don’t want to bother creating an instance themselves.

Apply fixes suggested by C.T. Hoyt:

* Remove `object` as an explicit superclass.
* Apply standard convention for naming of type aliases.
* Fix incorrect type hint in the constructor of ObjectConverter.
matentzn
matentzn previously approved these changes Oct 31, 2025
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long to look at this.

I have reviewed mostly everything apart from src/sssom/rdf_internal.py (I scanned it from an architectural viewpoint). This all looks clean, well designed and performant!

def from_rdf(self, obj: Node) -> str:
"""Convert a RDF node into an entity reference value."""
value = super().from_rdf(obj)
return self.ccp().compress(value, passthrough=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, better handle in dedicated PR

:returns: True if `name` is the name of a valid extension,
otherwise False.
"""
# FIXME: Not implemented yet, ignore all for now
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you want a sep. PR for this. IN the meantime, maybe logging.warning()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you want a sep. PR for this

Yes.

First because extension slots are completely unsupported throughout SSSOM-Py anyway, so not supporting them in RDF import/export is not a big deal.

Second (and most importantly), since extension slots are currently completely unsupported, I think we need to discuss first about how we are going to support them in the rest of SSSOM-Py – in particular, how they are going to be represented within a MSDF.

For now, I am assuming that extension slots will be represented in the most direct way possible:

  • a foo extension slot at the mapping set level will be represented as a foo key in the metadata part of the MSDF;
  • a bar extension slot at the mapping level will be represented as a bar column in the data frame part of the MSDF.

I don’t think it needs to be more complicated than that, and it is consistent with the way the other (non-extension) slots are represented.

Still, this needs discussion, if only because this will require updating (or, in fact, completely rewriting from scratch) the existing parsers and writers (apart from the RDF ones) so that they can support extensions – for now, they cannot do that since they rely heavily on the LinkML’s dumpers and loaders, which do not allow the presence of unpexpected slots.

Extension slots are currently not supported throughout SSSOM-Py, but we
should at least emit a warning when we are ignoring them.
Avoid one- or two-letters argument names (`g`, `cc`). Use longer, more
explicit names instead.
@cthoyt
Copy link
Member

cthoyt commented Nov 8, 2025

I've made a PR onto this PR gouttegd#3 that improves the control flow

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 8, 2025

I've made a PR onto this PR gouttegd#3 that improves the control flow

Commenting here to keep the discussion in a single place.

Your PR “improves” the control flow by making the MappingSetRDFConverter class in effect not reusable (unless for sets that all use the same prefix map), thereby negating one of the main interests of that class.

The class is designed so that you can do something like this:

from rdflib import Graph
from sssom.rdf_internal import MappingSetRDFConverter

# Creating a converter object once and for all
converter = MappingSetRDFConverter()

# Extracting a set from a graph
msdf1 = converter.msdf_from_rdf(Graph().parse("my-set-1.ttl"))

# Extracting another set from another graph
msdf2 = converter.msdf_from_rdf(Graph().parse("my-set-2.ttl"))

# Extracting yet another set from yet another graph
msdf3 = converter.msdf_from_rdf(Graph().parse("my-set-3.ttl"))

# and so on...

and likewise for RDF export (msdf_to_rdf).

The “improved” version requires instead to create a whole new converter for every single set to extract:

g1 = Graph().parse("my-set-1.ttl")
msdf1 = MappingSetRDFConverter(Converter.from_rdflib(g1)).msdf_from_rdf(g1)

g2 = Graph().parse("my-set-2.ttl")
msdf2 = MappingSetRDFConverter(Converter.from_rdflib(g2)).msdf_from_rdf(g2)

# and so on...

And then there is no longer any real point in having a MappingSetRDFConverter class to begin with...

At the very least, such a design would absolutely need a clear warning in the documentation of the MappingSetRDFConverter constructor that any new instance can be used to (de)serialize several mapping sets if and only if all the sets share the same prefix map.

@matentzn
Copy link
Collaborator

@gouttegd from your perspective this is ready to merge right?

@cthoyt

  • re control flow change you suggested in Update control flow gouttegd/sssom-py#3, since they are not in line with the orginal vision behind this PR here, you will have to file an issue (sorry, I know this sounds like unnecessary extra work) that explains why you believe its needed and the current design is not enough
  • any other observations? I would like to merge asap

@gouttegd
Copy link
Contributor Author

from your perspective this is ready to merge right?

Assuming there is agreement on the overall design, yes.

If there is no agreement on the design: I personally think supporting the RDF serialisation as described in the spec is more important than the exact design used to achieve that aim. I am not wed to my initial design and could consider changing it, but not for the design proposed in gouttegd#3, at least not in its current state.

Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have considered the design again; its unfortunate we lost @cthoyt on this one, but I agree with keeping the serialiser class reusable across transformations rather than tying the constructor to a specific converter. I am not a entirely shure about the ConverterProvider, but then again I am the worst coder among us, so my judgment counts for little.

THANK YOU @gouttegd for pushing this, I would like to move forward with this whenever you are ready!

@gouttegd gouttegd merged commit ac0f769 into mapping-commons:master Nov 22, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for “new” RDF serialisation

3 participants