-
Notifications
You must be signed in to change notification settings - Fork 14
Use custom code for RDF (de)serialisation #635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use custom code for RDF (de)serialisation #635
Conversation
Since the normalised variant of SSSOM/RDF does not match the output produced (respectively expected) by the LinkML runtime's dumper (resp. loader), we implement our own serialisation/deserialisation routines for RDF. All the code for RDF import/export is contained within the sssom.rdf_internal module. The only class expected to be used outside of that module is the MappingSetRDFConverter class. The `from_sssom_rdf` and `to_rdf_graph` functions in the sssom.parsers and sssom.writers modules are rewritten to use that class.
The new RDF converter revealed several issues with our existing tests, which are fixed here. * test_convert.py (test_to_rdf_hydrated): That test used an incorrect "NOT" value for a predicate modifier slot (the correct value is "Not", and there is nothing in the spec to suggest it is case-insensitive); it also forgot to declare some prefixes; * test_writers.py (test_rdflib_endpoint): That test used "sssom:MappingCuration" as a column name in a MappingSetDataFrame, instead of the expected "mapping_justification"; it also forgot to declare some prefixes.
Add a couple of tests to check the behaviour of the new RDF parser. Also fix an issue (revealed by one of those tests) with the RECORD_ID slot, which must be compressed upon parsing since it is an EntityReference-typed slot.
Use the latest pre-release of the SSSOM schema, which includes the changes related to the latest draft of the SSSOM/RDF specification.
|
@gouttegd there are a few parts to the control flow I think could be improved, do you mind If I push some updates directly to your branch? otherwise I'll PR it to your branch |
| def from_rdf(self, obj: Node) -> str: | ||
| """Convert a RDF node into an entity reference value.""" | ||
| value = super().from_rdf(obj) | ||
| return self.ccp().compress(value, passthrough=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider passthrough=True is a bandage for end-users of curies who have very poorly formatted data and Just Want It To Work™. I think we should more explicitly handle what happens when the value either 1) isn't a valid URI or 2) can't be compressed for some reason based on the converter. I would prefer to throw exceptions in both scenarios, which could get caught by code that calls this and decide whether to be strict or not about swallowing them or bubbling them up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I principle I agree, and in fact I initially did not use passthrough=True, but then this caused the test suite to fail, because some of the very sets that we use in our own test suite apparently contain poorly formatted data – which was accepted without errors or warnings by the current RDF parser and writer.
I think the new code should be, at least initially, as tolerant as the old one. If the new code refuses a set that the old one let pass, I’d consider that a bug of the new code (and this is why I had to use passthrough=True).
I wouldn’t be opposed to making the code more strict, but that would belong to another PR IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn’t be opposed to making the code more strict, but that would belong to another PR IMHO.
To be clear: I would in fact be very much in favour of making the code more strict. But:
- This would break existing code or sets and must therefore be considered carefully (I know SSSOM-Py is still in 0.x and therefore we are allowed to break things as much as we want; doesn’t mean we should do it lightly).
- This would require changes to more than just the RDF parser/writer.
Two reasons why this would belong to a distinct, dedicated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, better handle in dedicated PR
|
|
||
| def __init__(self) -> None: | ||
| """Create a new instance.""" | ||
| self.constructors = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a module-level variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to avoid module-level variable if possible. Such a variable is initialised as soon as the module is loaded.
Suppose you are doing this:
from sssom.parsers import parse_sssom_table
from sssom.writers import write_json
msdf = parse_sssom_table("myset.sssom.tsv")
write_json(msdf, "myset.sssom.json")Since you are loading the sssom.parsers module, you are also loading the sssom.rdf_internal module, and initialising any module-level variables in that module – even though you never even intended to do anything related to RDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it's a legitimate concern given it's a small dictionary with elements that are already interpreted on import time anyway, but if you're insistent, then you can always wrap in a functionan call like:
from functools import lru_cache
@lru_cache
def _get_my_data():
return {}
The point being that this class could just be the create function all by itself
Err, as a matter of fact I do mind. I’d like to see your proposed updates before they end up on my branch. Sorry, but distributed revision control systems like Git were invented for a reason after all. :) |
| # Conversion from RDF | ||
| # | ||
|
|
||
| def msdf_from_rdf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should either be module level, or a classmethod that constructs a MappingSetRDFConverter() inside it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make the MappingSetRDFConverter class a “one-shot“ class: you create an instance to convert one graph into a MSDF (or one MSDF into a graph), and then it is spent, if you want to convert another object you must create another instance.
Yes, for the sssom.parsers.sssom_from_rdf and sssom.writers.to_rdf_graph functions within SSSOM-Py this is all we need, but I’d rather keep the MappingSetRDFConverter class reusable for multiple conversions for those who might need that.
No objection to add a module-level msdf_from_rdf function that does nothing but create an instance and immediately call the msdf_from_rdf method on it (that’s basically what sssom.parsers.sssom_from_rdf is already doing), as a convenience for those who don’t want to bother creating an instance themselves.
Apply fixes suggested by C.T. Hoyt: * Remove `object` as an explicit superclass. * Apply standard convention for naming of type aliases. * Fix incorrect type hint in the constructor of ObjectConverter.
matentzn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to look at this.
I have reviewed mostly everything apart from src/sssom/rdf_internal.py (I scanned it from an architectural viewpoint). This all looks clean, well designed and performant!
| def from_rdf(self, obj: Node) -> str: | ||
| """Convert a RDF node into an entity reference value.""" | ||
| value = super().from_rdf(obj) | ||
| return self.ccp().compress(value, passthrough=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, better handle in dedicated PR
| :returns: True if `name` is the name of a valid extension, | ||
| otherwise False. | ||
| """ | ||
| # FIXME: Not implemented yet, ignore all for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you want a sep. PR for this. IN the meantime, maybe logging.warning()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you want a sep. PR for this
Yes.
First because extension slots are completely unsupported throughout SSSOM-Py anyway, so not supporting them in RDF import/export is not a big deal.
Second (and most importantly), since extension slots are currently completely unsupported, I think we need to discuss first about how we are going to support them in the rest of SSSOM-Py – in particular, how they are going to be represented within a MSDF.
For now, I am assuming that extension slots will be represented in the most direct way possible:
- a foo extension slot at the mapping set level will be represented as a foo key in the
metadatapart of the MSDF; - a bar extension slot at the mapping level will be represented as a bar column in the data frame part of the MSDF.
I don’t think it needs to be more complicated than that, and it is consistent with the way the other (non-extension) slots are represented.
Still, this needs discussion, if only because this will require updating (or, in fact, completely rewriting from scratch) the existing parsers and writers (apart from the RDF ones) so that they can support extensions – for now, they cannot do that since they rely heavily on the LinkML’s dumpers and loaders, which do not allow the presence of unpexpected slots.
Extension slots are currently not supported throughout SSSOM-Py, but we should at least emit a warning when we are ignoring them.
Avoid one- or two-letters argument names (`g`, `cc`). Use longer, more explicit names instead.
|
I've made a PR onto this PR gouttegd#3 that improves the control flow |
Commenting here to keep the discussion in a single place. Your PR “improves” the control flow by making the The class is designed so that you can do something like this: from rdflib import Graph
from sssom.rdf_internal import MappingSetRDFConverter
# Creating a converter object once and for all
converter = MappingSetRDFConverter()
# Extracting a set from a graph
msdf1 = converter.msdf_from_rdf(Graph().parse("my-set-1.ttl"))
# Extracting another set from another graph
msdf2 = converter.msdf_from_rdf(Graph().parse("my-set-2.ttl"))
# Extracting yet another set from yet another graph
msdf3 = converter.msdf_from_rdf(Graph().parse("my-set-3.ttl"))
# and so on...and likewise for RDF export ( The “improved” version requires instead to create a whole new converter for every single set to extract: g1 = Graph().parse("my-set-1.ttl")
msdf1 = MappingSetRDFConverter(Converter.from_rdflib(g1)).msdf_from_rdf(g1)
g2 = Graph().parse("my-set-2.ttl")
msdf2 = MappingSetRDFConverter(Converter.from_rdflib(g2)).msdf_from_rdf(g2)
# and so on...And then there is no longer any real point in having a At the very least, such a design would absolutely need a clear warning in the documentation of the |
|
@gouttegd from your perspective this is ready to merge right?
|
Assuming there is agreement on the overall design, yes. If there is no agreement on the design: I personally think supporting the RDF serialisation as described in the spec is more important than the exact design used to achieve that aim. I am not wed to my initial design and could consider changing it, but not for the design proposed in gouttegd#3, at least not in its current state. |
matentzn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have considered the design again; its unfortunate we lost @cthoyt on this one, but I agree with keeping the serialiser class reusable across transformations rather than tying the constructor to a specific converter. I am not a entirely shure about the ConverterProvider, but then again I am the worst coder among us, so my judgment counts for little.
THANK YOU @gouttegd for pushing this, I would like to move forward with this whenever you are ready!
This PR implements option ”A” mentioned in #628. That is, it replaces the LinkML-based code for serialising/deserialising to/from RDF by a completely custom code that implements the new SSSOM/RDF spec exactly as we want it.
All the new code is contained within the new
sssom.rdf_internalmodule, which as the name implies is not really supposed to be used outside of SSSOM-Py (though of course people can still do whatever they want!). The public interface to read from RDF is still thefrom_sssom_rdfmethod insssom.parsers, and the public interface to write to RDF is still theto_rdf_graphmethod insssom.writers. Both methods are rewritten to use the MappingSetRDFConverter class of thesssom.rdf_internalmodule, instead of methods from the LinkML runtime.closes #628