Skip to content

Conversation

@pragya16067
Copy link

Support added for parsing RDF* graphs:
PR for Issue 955 (#955)
Changes:

  • Additional code added in parser of RDF (TTL) graphs, i.e. notation3.py, to change any embedded triples in Star format to traditional reification statements.
  • Tested for graphs with multiple embedded statements and examples added to test for RDF* embedded triples present as subject, object, both subject and object as well as recursively embedded triples.

@pragya16067 pragya16067 mentioned this pull request Jun 4, 2020
@nicholascar
Copy link
Member

@JervenBolleman, could you please review this?

@JervenBolleman
Copy link
Contributor

@nicholascar I think this is fine conceptually.

However, I would like to see a discussion about the impact to this by the maintainers.

For now TTL and N3 do not allow RDFstar. So if we add it where the code is modified then you can not use the existing TTL parser to verify that a file is valid TTL as TTLstar would pass.

I suspect, that this new logic needs to be configurable so that TTL or TTLstar support would be passed in as an option.

@ghost
Copy link

ghost commented Jun 16, 2022

Closing because both stale and incomplete

@ghost ghost closed this Jun 16, 2022
@blokhin
Copy link

blokhin commented Jun 16, 2022

Is this something to be addressed in the future? or there is no real interest in that?

@ghost
Copy link

ghost commented Jun 16, 2022

Is this something to be addressed in the future?

Yes

or there is no real interest in that?

Far from it - https://github.com/RDFLib/rdflib-rdfstar

And see aucampia's comments on issue #1990 of parsing RDF Star

I had a go at rebasing the work in this PR and exercised it with the W3C's RDF Star test suite which illustrated in no unccertain terms just how much there was left to do. The other approach (referenced above) is using the lark parser as a foundation, so it is likely to be a much more coherent and complete solution.

Just because you asked (serves you right 😄) .... to satisfy my own idle curiosity, I've been exploring Nichols Pilon and Gavin Carothers' lark-based implementation of a turtle parser as a potential RDFLib plug-in implementation.

I casually adapted it to parse into an RDFLib graph by redefining it as an RDFLib Parser ...

--- support/pymantic/pymantic/parsers/lark/turtle.py	2022-04-04 20:38:48.990139556 +0100
+++ rdflib-plugins/plugins/parsers/larkturtle.py	2022-06-09 14:30:15.114570685 +0100
@@ -2,7 +2,7 @@
 
 Usage::
 
-  from pymantic.parsers.lark import turtle_parser
+  from plugins.pymantic.parsers.lark import turtle_parser
   graph = turtle_parser.parse(io.open('a_file.ttl', mode='rt'))
   graph2 = turtle_parser.parse(\"\"\"@prefix p: <http://a.example/s>.
   p: <http://a.example/p> <http://a.example/o> .\"\"\")
@@ -12,8 +12,7 @@
 will be read into memory and parsed there.
 """
 
-from __future__ import unicode_literals
-
+import io
 import re
 
 from lark import (
@@ -21,23 +20,32 @@
     Transformer,
     Tree,
 )
+
 from lark.lexer import (
     Token,
 )
 
-from pymantic.compat import (
-    binary_type,
-)
-from pymantic.parsers.base import (
+from plugins.parsers.base import (
     BaseParser,
 )
-from pymantic.primitives import (
+from plugins.parsers.primitives import (
     BlankNode,
     Literal,
     NamedNode,
     Triple,
 )
-from pymantic.util import (
+from rdflib.parser import Parser, FileInputSource, StringInputSource
+
+from rdflib import logger
+import rdflib
+
+# from rdflib import (
+#     BNode,
+#     Literal,
+#     URIRef,
+# )
+
+from plugins.parsers.util import (
     grouper,
     smart_urljoin,
     decode_literal,
@@ -305,28 +313,56 @@
                     yield triple
 
 
-def parse(string_or_stream, graph=None, base=""):
-    if hasattr(string_or_stream, "readline"):
-        string = string_or_stream.read()
-    else:
-        # Presume string.
-        string = string_or_stream
-
-    if isinstance(string_or_stream, binary_type):
-        string = string_or_stream.decode("utf-8")
-    else:
-        string = string_or_stream
-
-    tree = turtle_lark.parse(string)
-    tr = TurtleTransformer(base_iri=base)
-    if graph is None:
-        graph = tr._make_graph()
-    tr._prepare_parse(graph)
+class LarkTurtleParser(Parser):
+    format = None
 
-    graph.addAll(tr.transform(tree))
+    def __init__(self):
+        pass
 
-    return graph
+    def parse(self, string_or_stream, graph=None, base=""):
+        if hasattr(string_or_stream, "readline"):
+            string = string_or_stream.read()
+        else:
+            # Presume string.
+            string = string_or_stream
+
+        if isinstance(string_or_stream, bytes):
+            string = string_or_stream.decode("utf-8")
+        else:
+            string = string_or_stream
+
+        # TODO: stringify the remaining input sources
+        if isinstance(string, FileInputSource):
+            string = string.file.read().decode()
+        elif isinstance(string, StringInputSource):
+            string = string.getCharacterStream().read()
+
+        tree = turtle_lark.parse(string)
+        tr = TurtleTransformer(base_iri=base)
+        if graph is None:
+            graph = tr._make_graph()
+        tr._prepare_parse(graph)
+
+        for p, n in tr.prefixes.items():
+            logger.debug(f"ADDING {p} {n}")
+            graph.bind(p, n)
+
+        for (s, p, o) in tr.transform(tree):
+            triple = [s, p, o]
+            for pos, term in enumerate(triple):
+                if isinstance(term, NamedNode):
+                    triple[pos] = rdflib.URIRef(str(term))
+                elif isinstance(term, BlankNode):
+                    triple[pos] = rdflib.BNode(str(term))
+                elif isinstance(term, Literal):
+                    triple[pos] = rdflib.Literal(
+                        term.value, lang=term.language, datatype=term.datatype
+                    )
+                else:
+                    raise Exception(f"What is term {term} ({type(term)})")
+            graph.add(triple)
 
+        return graph
 
-def parse_string(string_or_bytes, graph=None, base=""):
-    return parse(string_or_bytes, graph, base)
+    def parse_string(string_or_bytes, graph=None, base=""):
+        return parse(string_or_bytes, graph, base)

One useful attribute of the lark parser is the ability to use EBNF grammar descriptions to drive the parsing. So it was reasonabky straightforward to adapt it further to handle RDF Star syntax by adding the RDF Star extensions to the Turtle EBNF spec (kindly provided by the W3C) and adding code to handle the quoted triple and quoted subjects (handling quoted objects yet to do) ...

--- rdflib-plugins/plugins/parsers/larkturtle.py	2022-06-09 14:30:15.114570685 +0100
+++ rdflib-plugins/plugins/parsers/larkturtlestar.py	2022-06-10 06:56:11.772314867 +0100
@@ -13,6 +13,7 @@
 """
 
 import io
+from pprint import pformat
 import re
 
 from lark import (
@@ -70,14 +71,13 @@
 base: BASE_DIRECTIVE IRIREF "."
 sparql_base: /BASE/i IRIREF
 sparql_prefix: /PREFIX/i PNAME_NS IRIREF
-triples: subject predicate_object_list
-       | blank_node_property_list predicate_object_list?
+triples: subject predicate_object_list | blank_node_property_list predicate_object_list?
 predicate_object_list: verb object_list (";" (verb object_list)?)*
-?object_list: object ("," object)*
+?object_list: object annotation? ("," object annotation? )*
 ?verb: predicate | /a/
-?subject: iri | blank_node | collection
+?subject: iri | blank_node | collection | quoted_triple
 ?predicate: iri
-?object: iri | blank_node | collection | blank_node_property_list | literal
+?object: iri | blank_node | collection | blank_node_property_list | literal | quoted_triple
 ?literal: rdf_literal | numeric_literal | boolean_literal
 blank_node_property_list: "[" predicate_object_list "]"
 collection: "(" object* ")"
@@ -91,6 +91,10 @@
 iri: IRIREF | prefixed_name
 prefixed_name: PNAME_LN | PNAME_NS
 blank_node: BLANK_NODE_LABEL | ANON
+quoted_triple: "<<" quote_subject verb quote_object ">>"
+quote_subject: iri | blank_node | quoted_triple
+quote_object : iri | blank_node | literal | quoted_triple
+annotation : "{\x7C" predicate_object_list "\x7C}"
 
 BASE_DIRECTIVE: "@base"
 IRIREF: "<" (/[^\x00-\x20<>"{}|^`\\]/ | UCHAR)* ">"
@@ -103,8 +107,8 @@
 DOUBLE: /[+-]?/ (/[0-9]+/ "." /[0-9]*/ EXPONENT
       | "." /[0-9]+/ EXPONENT | /[0-9]+/ EXPONENT)
 EXPONENT: /[eE][+-]?[0-9]+/
-STRING_LITERAL_QUOTE: "\"" (/[^\x22\\\x0A\x0D]/ | ECHAR | UCHAR)* "\""
-STRING_LITERAL_SINGLE_QUOTE: "'" (/[^\x27\\\x0A\x0D]/ | ECHAR | UCHAR)* "'"
+STRING_LITERAL_QUOTE: "\"" (/[^\x22\x5C\x0A\x0D]/ | ECHAR | UCHAR)* "\""
+STRING_LITERAL_SINGLE_QUOTE: "'" (/[^\x27\x5C\x0A\x0D]/ | ECHAR | UCHAR)* "'"
 STRING_LITERAL_LONG_SINGLE_QUOTE: "'''" (/'|''/? (/[^'\\]/ | ECHAR | UCHAR))* "'''"
 STRING_LITERAL_LONG_QUOTE: "\"\"\"" (/"|""/? (/[^"\\]/ | ECHAR | UCHAR))* "\"\"\""
 UCHAR: "\\u" HEX~4 | "\\U" HEX~8
@@ -122,11 +126,11 @@
 PN_LOCAL_ESC: "\\" /[_~\.\-!$&'()*+,;=\/?#@%]/
 
 %ignore WS
-COMMENT: "#" /[^\n]/*
+COMMENT: "#" /[^\r\n]/*
 %ignore COMMENT
 """
 
-turtle_lark = Lark(grammar, start="turtle_doc", parser="lalr")
+turtle_star_lark = Lark(grammar, start="turtle_doc", parser="lalr")
 
 
 LEGAL_IRI = re.compile(r'^[^\x00-\x20<>"{}|^`\\]*$')
@@ -166,7 +170,7 @@
             yield Triple(subject, predicate, object_)
 
 
-class TurtleTransformer(BaseParser, Transformer):
+class TurtleStarTransformer(BaseParser, Transformer):
     def __init__(self, base_iri=""):
         super().__init__()
         self.base_iri = base_iri
@@ -189,14 +193,54 @@
         return children
 
     def triples(self, children):
-        if len(children) == 2:
-            subject = children[0]
-            for triple in unpack_predicate_object_list(subject, children[1]):
+        logger.debug(f"TRIPLES:\n\n{children} {len(children)} {type(children)}\n\n")
+        if not isinstance(children[0], (NamedNode, BlankNode)):
+            logger.debug(f"UNPACKING QUOTEDTRIPLE {children[0]}, {children[1]}")
+            qres = [triple for triple in children[0]]
+            subject = qres[0][0]
+            res = [
+                triple for triple in unpack_predicate_object_list(subject, children[1])
+            ]
+            for triple in qres + res:
                 yield triple
-        elif len(children) == 1:
-            for triple_or_node in children[0]:
-                if isinstance(triple_or_node, Triple):
-                    yield triple_or_node
+        else:
+            if len(children) == 2:
+                subject = children[0]
+                logger.debug(f"UNPACKING PREDOBJ {subject}, {children[1]}")
+                for triple in unpack_predicate_object_list(subject, children[1]):
+                    yield triple
+            elif len(children) == 1:
+                # logger.debug(f"UNPACKING CHILDREN")
+                for triple_or_node in children[0]:
+                    if isinstance(triple_or_node, Triple):
+                        yield triple_or_node
+
+    def quoted_triple(self, children):
+        # logger.debug(f"QUOTEDTRIPLE:\n{children}")
+        quoted_statement_id = self.make_blank_node()
+
+        children += [NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement")]
+
+        preds = [
+            NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#subject"),
+            NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate"),
+            NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#object"),
+            NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),
+        ]
+        for pos, term in enumerate(children):
+            yield Triple(quoted_statement_id, preds[pos], term)
+
+    def quote_subject(self, children):
+        logger.debug(f"QUOTEDSUBJECT {children}")
+        # logger.debug(f"Returning subject {repr(children[0])}")
+        return children[0]
+
+    def quote_object(self, children):
+        logger.debug(f"QUOTEDOBJECT {children}")
+        if len(children) == 1 and isinstance(children[0], Literal):
+            # logger.debug(f"Returning Literal {children}")
+            return self.rdf_literal(children[0])
+        return children[0]
 
     def prefixed_name(self, children):
         (pname,) = children
@@ -249,13 +293,22 @@
         for value in reversed(children):
             this_bn = self.make_blank_node()
             if not isinstance(value, (NamedNode, Literal, BlankNode)):
+                logger.debug(f"COLLECTION0 {repr(value)}")
                 for triple_or_node in value:
                     if isinstance(triple_or_node, Triple):
+                        logger.debug(f"COLLECTION YIELDING {repr(triple_or_node)}")
                         yield triple_or_node
                     else:
+                        logger.debug(
+                            f"COLLECTION SETTING value to {repr(triple_or_node)}"
+                        )
                         value = triple_or_node
                         break
-            yield self.make_triple(this_bn, RDF_FIRST, value)
+            logger.debug(f"COLLECTION2 {repr(value)}")
+            if not isinstance(value, (NamedNode, Literal, BlankNode)):
+                yield self.make_triple(this_bn, RDF_FIRST, this_bn)
+            else:
+                yield self.make_triple(this_bn, RDF_FIRST, value)
             yield self.make_triple(this_bn, RDF_REST, prev_node)
             prev_node = this_bn
 
@@ -313,7 +366,7 @@
                     yield triple
 
 
-class LarkTurtleParser(Parser):
+class LarkTurtleStarParser(Parser):
     format = None
 
     def __init__(self):
@@ -337,31 +390,36 @@
         elif isinstance(string, StringInputSource):
             string = string.getCharacterStream().read()
 
-        tree = turtle_lark.parse(string)
-        tr = TurtleTransformer(base_iri=base)
+        tree = turtle_star_lark.parse(string)
+        tr = TurtleStarTransformer(base_iri=base)
         if graph is None:
             graph = tr._make_graph()
         tr._prepare_parse(graph)
 
+        transformed_tree = list(tr.transform(tree))
+
+        logger.debug(f"TRANSFORMTREE\n{pformat(transformed_tree, width=120)}")
+
+        # for (s, p, o) in tr.transform(tree):
+        # for (s, p, o) in transformed_tree:
+        #     triple = [s, p, o]
+        for triple in tr.transform(tree):
+            if isinstance(triple, Triple):
+                _triple = [triple[0], triple[1], triple[2]]
+                for pos, term in enumerate(_triple):
+                    if isinstance(term, NamedNode):
+                        _triple[pos] = rdflib.URIRef(str(term))
+                    elif isinstance(term, BlankNode):
+                        _triple[pos] = rdflib.BNode(str(term))
+                    elif isinstance(term, Literal):
+                        _triple[pos] = rdflib.Literal(
+                            term.value, lang=term.language, datatype=term.datatype
+                        )
+                graph.add(_triple)
+
         for p, n in tr.prefixes.items():
-            logger.debug(f"ADDING {p} {n}")
             graph.bind(p, n)
 
-        for (s, p, o) in tr.transform(tree):
-            triple = [s, p, o]
-            for pos, term in enumerate(triple):
-                if isinstance(term, NamedNode):
-                    triple[pos] = rdflib.URIRef(str(term))
-                elif isinstance(term, BlankNode):
-                    triple[pos] = rdflib.BNode(str(term))
-                elif isinstance(term, Literal):
-                    triple[pos] = rdflib.Literal(
-                        term.value, lang=term.language, datatype=term.datatype
-                    )
-                else:
-                    raise Exception(f"What is term {term} ({type(term)})")
-            graph.add(triple)
-
         return graph
 
     def parse_string(string_or_bytes, graph=None, base=""):

It's decidely messy (and, atm, incorrect) work-in-progress but I hope it provides an illustration of the kind of things that need to happen in order to extend the parsers to handle RDFStar syntax. At this stage of the exploration, this more principled approach does seem to be reasonably promising.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants