Use custom code for RDF (de)serialisation #635

gouttegd · 2025-10-29T01:08:57Z

This PR implements option ”A” mentioned in #628. That is, it replaces the LinkML-based code for serialising/deserialising to/from RDF by a completely custom code that implements the new SSSOM/RDF spec exactly as we want it.

All the new code is contained within the new sssom.rdf_internal module, which as the name implies is not really supposed to be used outside of SSSOM-Py (though of course people can still do whatever they want!). The public interface to read from RDF is still the from_sssom_rdf method in sssom.parsers, and the public interface to write to RDF is still the to_rdf_graph method in sssom.writers. Both methods are rewritten to use the MappingSetRDFConverter class of the sssom.rdf_internal module, instead of methods from the LinkML runtime.

closes #628

Since the normalised variant of SSSOM/RDF does not match the output produced (respectively expected) by the LinkML runtime's dumper (resp. loader), we implement our own serialisation/deserialisation routines for RDF. All the code for RDF import/export is contained within the sssom.rdf_internal module. The only class expected to be used outside of that module is the MappingSetRDFConverter class. The `from_sssom_rdf` and `to_rdf_graph` functions in the sssom.parsers and sssom.writers modules are rewritten to use that class.

The new RDF converter revealed several issues with our existing tests, which are fixed here. * test_convert.py (test_to_rdf_hydrated): That test used an incorrect "NOT" value for a predicate modifier slot (the correct value is "Not", and there is nothing in the spec to suggest it is case-insensitive); it also forgot to declare some prefixes; * test_writers.py (test_rdflib_endpoint): That test used "sssom:MappingCuration" as a column name in a MappingSetDataFrame, instead of the expected "mapping_justification"; it also forgot to declare some prefixes.

Add a couple of tests to check the behaviour of the new RDF parser. Also fix an issue (revealed by one of those tests) with the RECORD_ID slot, which must be compressed upon parsing since it is an EntityReference-typed slot.

Use the latest pre-release of the SSSOM schema, which includes the changes related to the latest draft of the SSSOM/RDF specification.

cthoyt · 2025-10-29T16:38:21Z

@gouttegd there are a few parts to the control flow I think could be improved, do you mind If I push some updates directly to your branch? otherwise I'll PR it to your branch

cthoyt · 2025-10-29T16:42:44Z

src/sssom/rdf_internal.py

+    def from_rdf(self, obj: Node) -> str:
+        """Convert a RDF node into an entity reference value."""
+        value = super().from_rdf(obj)
+        return self.ccp().compress(value, passthrough=True)


I consider passthrough=True is a bandage for end-users of curies who have very poorly formatted data and Just Want It To Work™. I think we should more explicitly handle what happens when the value either 1) isn't a valid URI or 2) can't be compressed for some reason based on the converter. I would prefer to throw exceptions in both scenarios, which could get caught by code that calls this and decide whether to be strict or not about swallowing them or bubbling them up

I principle I agree, and in fact I initially did not use passthrough=True, but then this caused the test suite to fail, because some of the very sets that we use in our own test suite apparently contain poorly formatted data – which was accepted without errors or warnings by the current RDF parser and writer.

I think the new code should be, at least initially, as tolerant as the old one. If the new code refuses a set that the old one let pass, I’d consider that a bug of the new code (and this is why I had to use passthrough=True).

I wouldn’t be opposed to making the code more strict, but that would belong to another PR IMHO.

I wouldn’t be opposed to making the code more strict, but that would belong to another PR IMHO.

To be clear: I would in fact be very much in favour of making the code more strict. But:

This would break existing code or sets and must therefore be considered carefully (I know SSSOM-Py is still in 0.x and therefore we are allowed to break things as much as we want; doesn’t mean we should do it lightly).

This would require changes to more than just the RDF parser/writer.

Two reasons why this would belong to a distinct, dedicated PR.

I agree, better handle in dedicated PR

cthoyt · 2025-10-29T16:45:04Z

src/sssom/rdf_internal.py

+
+    def __init__(self) -> None:
+        """Create a new instance."""
+        self.constructors = {


this can be a module-level variable

I like to avoid module-level variable if possible. Such a variable is initialised as soon as the module is loaded.

Suppose you are doing this:

from sssom.parsers import parse_sssom_table from sssom.writers import write_json msdf = parse_sssom_table("myset.sssom.tsv") write_json(msdf, "myset.sssom.json")

Since you are loading the sssom.parsers module, you are also loading the sssom.rdf_internal module, and initialising any module-level variables in that module – even though you never even intended to do anything related to RDF.

I don't think that it's a legitimate concern given it's a small dictionary with elements that are already interpreted on import time anyway, but if you're insistent, then you can always wrap in a functionan call like:

from functools import lru_cache @lru_cache def _get_my_data(): return {}

The point being that this class could just be the create function all by itself

gouttegd · 2025-10-29T18:25:35Z

do you mind If I push some updates directly to your branch?

Err, as a matter of fact I do mind. I’d like to see your proposed updates before they end up on my branch. Sorry, but distributed revision control systems like Git were invented for a reason after all. :)

src/sssom/rdf_internal.py

cthoyt · 2025-10-29T18:46:16Z

src/sssom/rdf_internal.py

+    # Conversion from RDF
+    #
+
+    def msdf_from_rdf(


I think this function should either be module level, or a classmethod that constructs a MappingSetRDFConverter() inside it

This would make the MappingSetRDFConverter class a “one-shot“ class: you create an instance to convert one graph into a MSDF (or one MSDF into a graph), and then it is spent, if you want to convert another object you must create another instance.

Yes, for the sssom.parsers.sssom_from_rdf and sssom.writers.to_rdf_graph functions within SSSOM-Py this is all we need, but I’d rather keep the MappingSetRDFConverter class reusable for multiple conversions for those who might need that.

No objection to add a module-level msdf_from_rdf function that does nothing but create an instance and immediately call the msdf_from_rdf method on it (that’s basically what sssom.parsers.sssom_from_rdf is already doing), as a convenience for those who don’t want to bother creating an instance themselves.

src/sssom/rdf_internal.py

Apply fixes suggested by C.T. Hoyt: * Remove `object` as an explicit superclass. * Apply standard convention for naming of type aliases. * Fix incorrect type hint in the constructor of ObjectConverter.

matentzn

Sorry it took so long to look at this.

I have reviewed mostly everything apart from src/sssom/rdf_internal.py (I scanned it from an architectural viewpoint). This all looks clean, well designed and performant!

matentzn · 2025-10-31T13:38:07Z

src/sssom/rdf_internal.py

+    def from_rdf(self, obj: Node) -> str:
+        """Convert a RDF node into an entity reference value."""
+        value = super().from_rdf(obj)
+        return self.ccp().compress(value, passthrough=True)


I agree, better handle in dedicated PR

src/sssom/rdf_internal.py

matentzn · 2025-10-31T13:44:08Z

src/sssom/rdf_internal.py

+        :returns: True if `name` is the name of a valid extension,
+            otherwise False.
+        """
+        # FIXME: Not implemented yet, ignore all for now


I assume you want a sep. PR for this. IN the meantime, maybe logging.warning()?

I assume you want a sep. PR for this

Yes.

First because extension slots are completely unsupported throughout SSSOM-Py anyway, so not supporting them in RDF import/export is not a big deal.

Second (and most importantly), since extension slots are currently completely unsupported, I think we need to discuss first about how we are going to support them in the rest of SSSOM-Py – in particular, how they are going to be represented within a MSDF.

For now, I am assuming that extension slots will be represented in the most direct way possible:

a foo extension slot at the mapping set level will be represented as a foo key in the metadata part of the MSDF;

a bar extension slot at the mapping level will be represented as a bar column in the data frame part of the MSDF.

I don’t think it needs to be more complicated than that, and it is consistent with the way the other (non-extension) slots are represented.

Still, this needs discussion, if only because this will require updating (or, in fact, completely rewriting from scratch) the existing parsers and writers (apart from the RDF ones) so that they can support extensions – for now, they cannot do that since they rely heavily on the LinkML’s dumpers and loaders, which do not allow the presence of unpexpected slots.

Extension slots are currently not supported throughout SSSOM-Py, but we should at least emit a warning when we are ignoring them.

Avoid one- or two-letters argument names (`g`, `cc`). Use longer, more explicit names instead.

cthoyt · 2025-11-08T12:08:36Z

I've made a PR onto this PR gouttegd#3 that improves the control flow

gouttegd · 2025-11-08T17:43:06Z

I've made a PR onto this PR gouttegd#3 that improves the control flow

Commenting here to keep the discussion in a single place.

Your PR “improves” the control flow by making the MappingSetRDFConverter class in effect not reusable (unless for sets that all use the same prefix map), thereby negating one of the main interests of that class.

The class is designed so that you can do something like this:

from rdflib import Graph
from sssom.rdf_internal import MappingSetRDFConverter

# Creating a converter object once and for all
converter = MappingSetRDFConverter()

# Extracting a set from a graph
msdf1 = converter.msdf_from_rdf(Graph().parse("my-set-1.ttl"))

# Extracting another set from another graph
msdf2 = converter.msdf_from_rdf(Graph().parse("my-set-2.ttl"))

# Extracting yet another set from yet another graph
msdf3 = converter.msdf_from_rdf(Graph().parse("my-set-3.ttl"))

# and so on...

and likewise for RDF export (msdf_to_rdf).

The “improved” version requires instead to create a whole new converter for every single set to extract:

g1 = Graph().parse("my-set-1.ttl")
msdf1 = MappingSetRDFConverter(Converter.from_rdflib(g1)).msdf_from_rdf(g1)

g2 = Graph().parse("my-set-2.ttl")
msdf2 = MappingSetRDFConverter(Converter.from_rdflib(g2)).msdf_from_rdf(g2)

# and so on...

And then there is no longer any real point in having a MappingSetRDFConverter class to begin with...

At the very least, such a design would absolutely need a clear warning in the documentation of the MappingSetRDFConverter constructor that any new instance can be used to (de)serialize several mapping sets if and only if all the sets share the same prefix map.

matentzn · 2025-11-10T17:46:35Z

@gouttegd from your perspective this is ready to merge right?

@cthoyt

re control flow change you suggested in Update control flow gouttegd/sssom-py#3, since they are not in line with the orginal vision behind this PR here, you will have to file an issue (sorry, I know this sounds like unnecessary extra work) that explains why you believe its needed and the current design is not enough
any other observations? I would like to merge asap

gouttegd · 2025-11-10T18:12:05Z

from your perspective this is ready to merge right?

Assuming there is agreement on the overall design, yes.

If there is no agreement on the design: I personally think supporting the RDF serialisation as described in the spec is more important than the exact design used to achieve that aim. I am not wed to my initial design and could consider changing it, but not for the design proposed in gouttegd#3, at least not in its current state.

matentzn

I have considered the design again; its unfortunate we lost @cthoyt on this one, but I agree with keeping the serialiser class reusable across transformations rather than tying the constructor to a specific converter. I am not a entirely shure about the ConverterProvider, but then again I am the worst coder among us, so my judgment counts for little.

THANK YOU @gouttegd for pushing this, I would like to move forward with this whenever you are ready!

gouttegd added 4 commits October 29, 2025 00:55

Add RDF parsing tests.

1a43b25

Add a couple of tests to check the behaviour of the new RDF parser. Also fix an issue (revealed by one of those tests) with the RECORD_ID slot, which must be compressed upon parsing since it is an EntityReference-typed slot.

Update the sssom_schema dependency.

5db64b6

Use the latest pre-release of the SSSOM schema, which includes the changes related to the latest draft of the SSSOM/RDF specification.

gouttegd self-assigned this Oct 29, 2025

gouttegd mentioned this pull request Oct 29, 2025

Use custom code for RDF (de)serialisation #632

Closed

gouttegd requested a review from matentzn October 29, 2025 01:18

cthoyt reviewed Oct 29, 2025

View reviewed changes

src/sssom/rdf_internal.py Outdated Show resolved Hide resolved

cthoyt reviewed Oct 29, 2025

View reviewed changes

src/sssom/rdf_internal.py Outdated Show resolved Hide resolved

cthoyt reviewed Oct 29, 2025

View reviewed changes

src/sssom/rdf_internal.py Show resolved Hide resolved

cthoyt reviewed Oct 29, 2025

View reviewed changes

src/sssom/rdf_internal.py Outdated Show resolved Hide resolved

Minor fixes.

4fa923d

Apply fixes suggested by C.T. Hoyt: * Remove `object` as an explicit superclass. * Apply standard convention for naming of type aliases. * Fix incorrect type hint in the constructor of ObjectConverter.

matentzn previously approved these changes Oct 31, 2025

View reviewed changes

gouttegd added 3 commits November 7, 2025 14:13

Typo fix.

b919012

Emit warnings when ignoring (possible) extension slots.

04e42ee

Extension slots are currently not supported throughout SSSOM-Py, but we should at least emit a warning when we are ignoring them.

Rename short arguments.

06ff507

Avoid one- or two-letters argument names (`g`, `cc`). Use longer, more explicit names instead.

gouttegd dismissed matentzn’s stale review via 06ff507 November 7, 2025 15:02

matentzn approved these changes Nov 21, 2025

View reviewed changes

gouttegd merged commit ac0f769 into mapping-commons:master Nov 22, 2025
12 checks passed

gouttegd mentioned this pull request Nov 22, 2025

mapping_set_id in RDF should be subject instead of literal object #590

Closed

Use custom code for RDF (de)serialisation #635

Use custom code for RDF (de)serialisation #635

Uh oh!

Conversation

gouttegd commented Oct 29, 2025

Uh oh!

cthoyt commented Oct 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gouttegd Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gouttegd commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cthoyt Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matentzn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cthoyt commented Nov 8, 2025

Uh oh!

gouttegd commented Nov 8, 2025

Uh oh!

matentzn commented Nov 10, 2025

Uh oh!

gouttegd commented Nov 10, 2025

Uh oh!

matentzn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gouttegd Oct 29, 2025 •

edited

Loading

cthoyt Oct 29, 2025 •

edited

Loading