-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROPOSAL] New Parser and Serializer interface #2897
Comments
Sounds similar to what the folks in the JavaScript RDF ecosystem have standardised on. There's a bunch of specs listed at https://rdf.js.org/, including quad stream, source, sink, and store interfaces. |
Thanks for the link @edmondchuc. I wasn't aware of the work at https://rdf.js.org/ but it makes sense they'd be up to date with the latest new implementation patterns. |
Do you envision this overhaul would give access to a streaming interface to graphs stored in files ( Also, if I missed that such an interface is currently available, I welcome a pointer. |
Yes, that is one of the benefits it would provide for parsers like And you're correct, none of the parser implementations currently in RDFLib provide an interface like that. |
Thank you for clarifying. On the JSON bits you mentioned - theoretically, if context dictionaries could be guaranteed/agreed-upon to always come first in JSON Objects, could a streaming interface be provided for JSON-LD too, using an interative JSON parser? Is there a name for such a constrained-JSON version of JSON-LD? |
Yes that's right, it we switched to using an interactive json parser, and context is expected/guaranteed to be the first object in the document then it would be possible. However I believe it would still be restricted to JSON-LD v1.0, because v1.1 allows additional embedded contexts deeper in the document. |
I did have the additional embedded contexts twist in mind when I asked if the context dictionary came first in JSON Objects -- I'd meant that with tolerance for whatever nesting level. I suppose that specific sort-constrained variant of JSON-LD doesn't have a name, though? Or, a callout in some canonicalization process? RFC 8785, Section 3.2.3 , which canonicalizes JSON (not JSON-LD), has a sort order prescription for object keys. It unfortunately would sort keys starting with a number before any {
"0d-boundary-points": [{"@id": "60a0ba8a-0fd0-44bb-8d74-fe926e5d7b0b", "@type": "Point"}],
"1d-boundary-lines": [{"@id": "c2af7e4b-bcc4-4d28-842e-fb864374b90a", "@type": "Line"}],
"2d-boundary-surfaces": [{"@id": "fd592932-3f4e-4c2f-8a1e-1bb335228666", "@type": "Surface"}],
"@context": {
"@base": "http://example.org/kb/",
"0d-boundary-points": "http://example.org/ontology/0d-boundary-points",
"1d-boundary-lines": "http://example.org/ontology/1d-boundary-lines",
"2d-boundary-surfaces": "http://example.org/ontology/2d-boundary-surfaces",
"Line": "http://example.org/ontology/Line",
"Point": "http://example.org/ontology/Point",
"Surface": "http://example.org/ontology/Surface",
"label": "http://www.w3.org/2000/01/rdf-schema#label"
},
"@id": "4190cd0b-0cee-4b72-a5f5-8a247a76d428",
"label": "A spatial thing"
} (That graph does parse like I expect it to - see below for N-Triples form.) N-Triples render of JSON-LD snippetRendered using1 the command <http://example.org/kb/4190cd0b-0cee-4b72-a5f5-8a247a76d428> <http://example.org/ontology/0d-boundary-points> <http://example.org/kb/60a0ba8a-0fd0-44bb-8d74-fe926e5d7b0b> .
<http://example.org/kb/4190cd0b-0cee-4b72-a5f5-8a247a76d428> <http://example.org/ontology/1d-boundary-lines> <http://example.org/kb/c2af7e4b-bcc4-4d28-842e-fb864374b90a> .
<http://example.org/kb/4190cd0b-0cee-4b72-a5f5-8a247a76d428> <http://example.org/ontology/2d-boundary-surfaces> <http://example.org/kb/fd592932-3f4e-4c2f-8a1e-1bb335228666> .
<http://example.org/kb/4190cd0b-0cee-4b72-a5f5-8a247a76d428> <http://www.w3.org/2000/01/rdf-schema#label> "A spatial thing" .
<http://example.org/kb/60a0ba8a-0fd0-44bb-8d74-fe926e5d7b0b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/ontology/Point> .
<http://example.org/kb/c2af7e4b-bcc4-4d28-842e-fb864374b90a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/ontology/Line> .
<http://example.org/kb/fd592932-3f4e-4c2f-8a1e-1bb335228666> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/ontology/Surface> . (I don't recall if there is a restriction on IRI local names preventing leading digits, but Turtle's grammar, particularly the So, it seems to me that even if JSON-LD were passed as canonicalized JSON, there would still need to be some buffering logic in a "streaming" JSON-LD parser to hold graph portions that need to wait for context dictionaries at their object nesting level. And, this streaming design faces some unfortunate cases if properties can be serialized with leading digits. Footnotes
|
It has been on my mind for a couple of years that the existing Parser and Serializer interface is a mess.
The current standard Parser interface predates multigraph support (from before
Dataset
and even beforeConjunctiveGraph
) so there is a terrible workaround that involves building aGraph
instance using thestore
backend of a multigraph, passing that to the parser, then the parser will then create a newConjunctiveGraph
instance internally using the samestore
backend as theGraph
, then parsing into that.The current interface predates unicode support in Python. All parsers and serializers originally consumed or emitted a ByteStream. Unicode support was later hacked into some but not all parsers and serializers, for the use in Python 2.6. When Python3 changed
unicode
tostr
, and old-str
tobytes
, this complicated things a whole bunch, and the parser/serializer interface still hasn't recovered.There has been some recent work done to rectify some of these problems (and fix some very long standing bugs) that will be released as part of RDFLib v7.1.0, but its hard to make many drastic changes without introducing breaking changes to the Parser/Serializer interface.
Recently I had the privilege of reading some of the Oxigraph source code (its a very well written and extremely high performance Rust-based sparql-engine with RDF Parser and Serializer support). I noticed a pattern in the Oxigraph source code that inspired some further thought about completely redesigning the RDFLib parser/serializer interface.
Oxigraph implements parsers as a Quad-source, and serializers as a Quad-sink.
Constructing a parser (giving it a file to parse) returns an Generator object. Iterating the generator will cause the parser to yield quads, as it is parsing the file.
Similarly, invoking a serializer requires passing in an iterator over a set of quads.
The fun byproduct of this pattern is you can implement a format converter simply by piping a parser into a serializer. No Graph needed.
Python supports everything required to make this pattern work in RDFLib. Python even has Async Generators, so we could use this interface to support concurrent async parsing and serializing.
This post is just the introduction to this idea. I want to write up some example code and see what is required to put together a new "RDFLib Serailizer Interface Standard" and a "RDFLib Parser Interface Standard", that allows all parsers and serializers to be implemented in a common manner. I want to get feedback and suggestions from contributors and interested parties, so we can make this as useful as possible for everyone.
The text was updated successfully, but these errors were encountered: