-
Notifications
You must be signed in to change notification settings - Fork 581
Support added for parsing RDF* Graphs #1111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@JervenBolleman, could you please review this? |
|
@nicholascar I think this is fine conceptually. However, I would like to see a discussion about the impact to this by the maintainers. For now TTL and N3 do not allow RDFstar. So if we add it where the code is modified then you can not use the existing TTL parser to verify that a file is valid TTL as TTLstar would pass. I suspect, that this new logic needs to be configurable so that TTL or TTLstar support would be passed in as an option. |
|
Closing because both stale and incomplete |
|
Is this something to be addressed in the future? or there is no real interest in that? |
Yes
Far from it - https://github.com/RDFLib/rdflib-rdfstar And see aucampia's comments on issue #1990 of parsing RDF Star I had a go at rebasing the work in this PR and exercised it with the W3C's RDF Star test suite which illustrated in no unccertain terms just how much there was left to do. The other approach (referenced above) is using the lark parser as a foundation, so it is likely to be a much more coherent and complete solution. Just because you asked (serves you right 😄) .... to satisfy my own idle curiosity, I've been exploring Nichols Pilon and Gavin Carothers' lark-based implementation of a turtle parser as a potential RDFLib plug-in implementation. I casually adapted it to parse into an RDFLib graph by redefining it as an RDFLib --- support/pymantic/pymantic/parsers/lark/turtle.py 2022-04-04 20:38:48.990139556 +0100
+++ rdflib-plugins/plugins/parsers/larkturtle.py 2022-06-09 14:30:15.114570685 +0100
@@ -2,7 +2,7 @@
Usage::
- from pymantic.parsers.lark import turtle_parser
+ from plugins.pymantic.parsers.lark import turtle_parser
graph = turtle_parser.parse(io.open('a_file.ttl', mode='rt'))
graph2 = turtle_parser.parse(\"\"\"@prefix p: <http://a.example/s>.
p: <http://a.example/p> <http://a.example/o> .\"\"\")
@@ -12,8 +12,7 @@
will be read into memory and parsed there.
"""
-from __future__ import unicode_literals
-
+import io
import re
from lark import (
@@ -21,23 +20,32 @@
Transformer,
Tree,
)
+
from lark.lexer import (
Token,
)
-from pymantic.compat import (
- binary_type,
-)
-from pymantic.parsers.base import (
+from plugins.parsers.base import (
BaseParser,
)
-from pymantic.primitives import (
+from plugins.parsers.primitives import (
BlankNode,
Literal,
NamedNode,
Triple,
)
-from pymantic.util import (
+from rdflib.parser import Parser, FileInputSource, StringInputSource
+
+from rdflib import logger
+import rdflib
+
+# from rdflib import (
+# BNode,
+# Literal,
+# URIRef,
+# )
+
+from plugins.parsers.util import (
grouper,
smart_urljoin,
decode_literal,
@@ -305,28 +313,56 @@
yield triple
-def parse(string_or_stream, graph=None, base=""):
- if hasattr(string_or_stream, "readline"):
- string = string_or_stream.read()
- else:
- # Presume string.
- string = string_or_stream
-
- if isinstance(string_or_stream, binary_type):
- string = string_or_stream.decode("utf-8")
- else:
- string = string_or_stream
-
- tree = turtle_lark.parse(string)
- tr = TurtleTransformer(base_iri=base)
- if graph is None:
- graph = tr._make_graph()
- tr._prepare_parse(graph)
+class LarkTurtleParser(Parser):
+ format = None
- graph.addAll(tr.transform(tree))
+ def __init__(self):
+ pass
- return graph
+ def parse(self, string_or_stream, graph=None, base=""):
+ if hasattr(string_or_stream, "readline"):
+ string = string_or_stream.read()
+ else:
+ # Presume string.
+ string = string_or_stream
+
+ if isinstance(string_or_stream, bytes):
+ string = string_or_stream.decode("utf-8")
+ else:
+ string = string_or_stream
+
+ # TODO: stringify the remaining input sources
+ if isinstance(string, FileInputSource):
+ string = string.file.read().decode()
+ elif isinstance(string, StringInputSource):
+ string = string.getCharacterStream().read()
+
+ tree = turtle_lark.parse(string)
+ tr = TurtleTransformer(base_iri=base)
+ if graph is None:
+ graph = tr._make_graph()
+ tr._prepare_parse(graph)
+
+ for p, n in tr.prefixes.items():
+ logger.debug(f"ADDING {p} {n}")
+ graph.bind(p, n)
+
+ for (s, p, o) in tr.transform(tree):
+ triple = [s, p, o]
+ for pos, term in enumerate(triple):
+ if isinstance(term, NamedNode):
+ triple[pos] = rdflib.URIRef(str(term))
+ elif isinstance(term, BlankNode):
+ triple[pos] = rdflib.BNode(str(term))
+ elif isinstance(term, Literal):
+ triple[pos] = rdflib.Literal(
+ term.value, lang=term.language, datatype=term.datatype
+ )
+ else:
+ raise Exception(f"What is term {term} ({type(term)})")
+ graph.add(triple)
+ return graph
-def parse_string(string_or_bytes, graph=None, base=""):
- return parse(string_or_bytes, graph, base)
+ def parse_string(string_or_bytes, graph=None, base=""):
+ return parse(string_or_bytes, graph, base)One useful attribute of the lark parser is the ability to use EBNF grammar descriptions to drive the parsing. So it was reasonabky straightforward to adapt it further to handle RDF Star syntax by adding the RDF Star extensions to the Turtle EBNF spec (kindly provided by the W3C) and adding code to handle the quoted triple and quoted subjects (handling quoted objects yet to do) ... --- rdflib-plugins/plugins/parsers/larkturtle.py 2022-06-09 14:30:15.114570685 +0100
+++ rdflib-plugins/plugins/parsers/larkturtlestar.py 2022-06-10 06:56:11.772314867 +0100
@@ -13,6 +13,7 @@
"""
import io
+from pprint import pformat
import re
from lark import (
@@ -70,14 +71,13 @@
base: BASE_DIRECTIVE IRIREF "."
sparql_base: /BASE/i IRIREF
sparql_prefix: /PREFIX/i PNAME_NS IRIREF
-triples: subject predicate_object_list
- | blank_node_property_list predicate_object_list?
+triples: subject predicate_object_list | blank_node_property_list predicate_object_list?
predicate_object_list: verb object_list (";" (verb object_list)?)*
-?object_list: object ("," object)*
+?object_list: object annotation? ("," object annotation? )*
?verb: predicate | /a/
-?subject: iri | blank_node | collection
+?subject: iri | blank_node | collection | quoted_triple
?predicate: iri
-?object: iri | blank_node | collection | blank_node_property_list | literal
+?object: iri | blank_node | collection | blank_node_property_list | literal | quoted_triple
?literal: rdf_literal | numeric_literal | boolean_literal
blank_node_property_list: "[" predicate_object_list "]"
collection: "(" object* ")"
@@ -91,6 +91,10 @@
iri: IRIREF | prefixed_name
prefixed_name: PNAME_LN | PNAME_NS
blank_node: BLANK_NODE_LABEL | ANON
+quoted_triple: "<<" quote_subject verb quote_object ">>"
+quote_subject: iri | blank_node | quoted_triple
+quote_object : iri | blank_node | literal | quoted_triple
+annotation : "{\x7C" predicate_object_list "\x7C}"
BASE_DIRECTIVE: "@base"
IRIREF: "<" (/[^\x00-\x20<>"{}|^`\\]/ | UCHAR)* ">"
@@ -103,8 +107,8 @@
DOUBLE: /[+-]?/ (/[0-9]+/ "." /[0-9]*/ EXPONENT
| "." /[0-9]+/ EXPONENT | /[0-9]+/ EXPONENT)
EXPONENT: /[eE][+-]?[0-9]+/
-STRING_LITERAL_QUOTE: "\"" (/[^\x22\\\x0A\x0D]/ | ECHAR | UCHAR)* "\""
-STRING_LITERAL_SINGLE_QUOTE: "'" (/[^\x27\\\x0A\x0D]/ | ECHAR | UCHAR)* "'"
+STRING_LITERAL_QUOTE: "\"" (/[^\x22\x5C\x0A\x0D]/ | ECHAR | UCHAR)* "\""
+STRING_LITERAL_SINGLE_QUOTE: "'" (/[^\x27\x5C\x0A\x0D]/ | ECHAR | UCHAR)* "'"
STRING_LITERAL_LONG_SINGLE_QUOTE: "'''" (/'|''/? (/[^'\\]/ | ECHAR | UCHAR))* "'''"
STRING_LITERAL_LONG_QUOTE: "\"\"\"" (/"|""/? (/[^"\\]/ | ECHAR | UCHAR))* "\"\"\""
UCHAR: "\\u" HEX~4 | "\\U" HEX~8
@@ -122,11 +126,11 @@
PN_LOCAL_ESC: "\\" /[_~\.\-!$&'()*+,;=\/?#@%]/
%ignore WS
-COMMENT: "#" /[^\n]/*
+COMMENT: "#" /[^\r\n]/*
%ignore COMMENT
"""
-turtle_lark = Lark(grammar, start="turtle_doc", parser="lalr")
+turtle_star_lark = Lark(grammar, start="turtle_doc", parser="lalr")
LEGAL_IRI = re.compile(r'^[^\x00-\x20<>"{}|^`\\]*$')
@@ -166,7 +170,7 @@
yield Triple(subject, predicate, object_)
-class TurtleTransformer(BaseParser, Transformer):
+class TurtleStarTransformer(BaseParser, Transformer):
def __init__(self, base_iri=""):
super().__init__()
self.base_iri = base_iri
@@ -189,14 +193,54 @@
return children
def triples(self, children):
- if len(children) == 2:
- subject = children[0]
- for triple in unpack_predicate_object_list(subject, children[1]):
+ logger.debug(f"TRIPLES:\n\n{children} {len(children)} {type(children)}\n\n")
+ if not isinstance(children[0], (NamedNode, BlankNode)):
+ logger.debug(f"UNPACKING QUOTEDTRIPLE {children[0]}, {children[1]}")
+ qres = [triple for triple in children[0]]
+ subject = qres[0][0]
+ res = [
+ triple for triple in unpack_predicate_object_list(subject, children[1])
+ ]
+ for triple in qres + res:
yield triple
- elif len(children) == 1:
- for triple_or_node in children[0]:
- if isinstance(triple_or_node, Triple):
- yield triple_or_node
+ else:
+ if len(children) == 2:
+ subject = children[0]
+ logger.debug(f"UNPACKING PREDOBJ {subject}, {children[1]}")
+ for triple in unpack_predicate_object_list(subject, children[1]):
+ yield triple
+ elif len(children) == 1:
+ # logger.debug(f"UNPACKING CHILDREN")
+ for triple_or_node in children[0]:
+ if isinstance(triple_or_node, Triple):
+ yield triple_or_node
+
+ def quoted_triple(self, children):
+ # logger.debug(f"QUOTEDTRIPLE:\n{children}")
+ quoted_statement_id = self.make_blank_node()
+
+ children += [NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement")]
+
+ preds = [
+ NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#subject"),
+ NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate"),
+ NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#object"),
+ NamedNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),
+ ]
+ for pos, term in enumerate(children):
+ yield Triple(quoted_statement_id, preds[pos], term)
+
+ def quote_subject(self, children):
+ logger.debug(f"QUOTEDSUBJECT {children}")
+ # logger.debug(f"Returning subject {repr(children[0])}")
+ return children[0]
+
+ def quote_object(self, children):
+ logger.debug(f"QUOTEDOBJECT {children}")
+ if len(children) == 1 and isinstance(children[0], Literal):
+ # logger.debug(f"Returning Literal {children}")
+ return self.rdf_literal(children[0])
+ return children[0]
def prefixed_name(self, children):
(pname,) = children
@@ -249,13 +293,22 @@
for value in reversed(children):
this_bn = self.make_blank_node()
if not isinstance(value, (NamedNode, Literal, BlankNode)):
+ logger.debug(f"COLLECTION0 {repr(value)}")
for triple_or_node in value:
if isinstance(triple_or_node, Triple):
+ logger.debug(f"COLLECTION YIELDING {repr(triple_or_node)}")
yield triple_or_node
else:
+ logger.debug(
+ f"COLLECTION SETTING value to {repr(triple_or_node)}"
+ )
value = triple_or_node
break
- yield self.make_triple(this_bn, RDF_FIRST, value)
+ logger.debug(f"COLLECTION2 {repr(value)}")
+ if not isinstance(value, (NamedNode, Literal, BlankNode)):
+ yield self.make_triple(this_bn, RDF_FIRST, this_bn)
+ else:
+ yield self.make_triple(this_bn, RDF_FIRST, value)
yield self.make_triple(this_bn, RDF_REST, prev_node)
prev_node = this_bn
@@ -313,7 +366,7 @@
yield triple
-class LarkTurtleParser(Parser):
+class LarkTurtleStarParser(Parser):
format = None
def __init__(self):
@@ -337,31 +390,36 @@
elif isinstance(string, StringInputSource):
string = string.getCharacterStream().read()
- tree = turtle_lark.parse(string)
- tr = TurtleTransformer(base_iri=base)
+ tree = turtle_star_lark.parse(string)
+ tr = TurtleStarTransformer(base_iri=base)
if graph is None:
graph = tr._make_graph()
tr._prepare_parse(graph)
+ transformed_tree = list(tr.transform(tree))
+
+ logger.debug(f"TRANSFORMTREE\n{pformat(transformed_tree, width=120)}")
+
+ # for (s, p, o) in tr.transform(tree):
+ # for (s, p, o) in transformed_tree:
+ # triple = [s, p, o]
+ for triple in tr.transform(tree):
+ if isinstance(triple, Triple):
+ _triple = [triple[0], triple[1], triple[2]]
+ for pos, term in enumerate(_triple):
+ if isinstance(term, NamedNode):
+ _triple[pos] = rdflib.URIRef(str(term))
+ elif isinstance(term, BlankNode):
+ _triple[pos] = rdflib.BNode(str(term))
+ elif isinstance(term, Literal):
+ _triple[pos] = rdflib.Literal(
+ term.value, lang=term.language, datatype=term.datatype
+ )
+ graph.add(_triple)
+
for p, n in tr.prefixes.items():
- logger.debug(f"ADDING {p} {n}")
graph.bind(p, n)
- for (s, p, o) in tr.transform(tree):
- triple = [s, p, o]
- for pos, term in enumerate(triple):
- if isinstance(term, NamedNode):
- triple[pos] = rdflib.URIRef(str(term))
- elif isinstance(term, BlankNode):
- triple[pos] = rdflib.BNode(str(term))
- elif isinstance(term, Literal):
- triple[pos] = rdflib.Literal(
- term.value, lang=term.language, datatype=term.datatype
- )
- else:
- raise Exception(f"What is term {term} ({type(term)})")
- graph.add(triple)
-
return graph
def parse_string(string_or_bytes, graph=None, base=""):It's decidely messy (and, atm, incorrect) work-in-progress but I hope it provides an illustration of the kind of things that need to happen in order to extend the parsers to handle RDFStar syntax. At this stage of the exploration, this more principled approach does seem to be reasonably promising. |
Support added for parsing RDF* graphs:
PR for Issue 955 (#955)
Changes: