Skip to content

Canonical form of SPARQL Patterns #483

@joernhees

Description

@joernhees

I'm currently performing >> 1M SPARQL Queries as part of some machine learning algorithm. As this takes a while, i thought about caching results for SPARQL queries. The problem here is, that different SPARQL queries can contain Variables with different names, but are isomorphic otherwise. Example:

select * where { ?s foo:bar foo:bla }

is isomorphic to

select * where { ?s2 foo:bar foo:bla }

For quick checking in a cache it would be cool to have a canonical form of a SPARQL Pattern, very much like #441 (rdflib.compare.to_canonical_graph(g1)) for rdflib.Graph.

A SPARQL Query's pattern part can be represented as an rdflib.Graph which contains Variables. By replacing Variables with BNodes (using the variable name as bnode id) one gets pretty close to a graph that one could use the to_canonical_graph algorithm on, with one exception: BNodes can't be used as predicates (RDF Concepts).

As this is out of spec, i guess it's ok this fails:

In [1]: from rdflib import *
INFO:rdflib:RDFLib Version: 4.2.1-dev

In [2]: from rdflib.compare import *

In [3]: g = Graph()

In [4]: g.add((BNode('v1'), BNode('v2'), URIRef('foo')))

In [5]: to_canonical_graph(g)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[...]
/usr/local/lib/python2.7/site-packages/rdflib/compare.pyc in _canonicalize_bnodes(self, triple, labels)
    456         for term in triple:
    457             if isinstance(term, BNode):
--> 458                 yield BNode(value="cb%s" % labels[term])
    459             else:
    460                 yield term

KeyError: rdflib.term.BNode('v2')

Nevertheless, as this is quite close to a cool feature and graph canonicalization isn't exactly the easiest problem to think about: is it maybe possible to slightly adapt the RGDA1 algorithm to support BNodes in the predicate position as well and thereby also making it fit for SPARQL Patterns? Maybe @jimmccusker has an idea on this?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions