Skip to content

Small but nice SPARQL Optimisation fix #547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 22, 2015
Merged

Conversation

gromgull
Copy link
Member

This reorders the triples in a BGP while evaluating, so that the "most
bound" triples are evaluated first. The algebra transformation also does
this, but "statically", i.e. without looking at the data.

At runtime, we may have bindings from other places
(BIND/VALUES/initBindings) which can massively improve query time.

In a BerkelyDB store with ~250k triples, I had a query with two triples
patterns that retrieved 38 out of 25000 possible matching triples with a
particular property. Adding this fix brought query time from 8s to
800ms.

@gromgull
Copy link
Member Author

As far as I can see the single failing test test_sparqlupdatestore has nothing to do with what I edited :(

@gromgull
Copy link
Member Author

It breaks while trying to parse the reply to the query produced from this: https://github.com/RDFLib/rdflib/blob/master/test/test_sparqlupdatestore.py#L237

I don't know where it goes wrong - on insert or query.

@@ -214,7 +214,10 @@ def evalPart(ctx, part):
pass # the given custome-function did not handle this part

if part.name == 'BGP':
return evalBGP(ctx, part.triples) # NOTE pass part.triples, not part!
# Reorder triples patterns based on 'bindedness' based on current ctx
triples = sorted(part.triples, key=lambda t: len([n for n in t if ctx[n] is None]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have problems understanding this... could you maybe expand the comment a bit more?

my guess is n is a node in triple t, so if ctx[n] is None then n is unbound, so you select those triples first which have least unbound nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your guess is spot on. I am not sure when this kicks in, for only "static" triples (with not optional/bindings/values/anything) - the code here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/algebra.py#L84 will already have sorted them correctly.

This code helps when some other source of bindings has given us more bindings since then - in my base it was using initBindings.

@joernhees
Copy link
Member

btw: the failing test seems related to the recent SPARQLWrapper updates, more in #550

@joernhees joernhees added this to the rdflib 4.2.2 milestone Nov 22, 2015
This reorders the triples in a BGP while evaluating, so that the "most
bound" triples are evaluated first. The algebra transformation also does
this, but "statically", i.e. without looking at the data.

At runtime, we may have bindings from other places
(BIND/VALUES/initBindings) which can massively improve query time.

In a BerkelyDB store with ~250k triples, I had a query with two triples
patterns that retrieved 38 out of 25000 possible matching triples with a
particular property. Adding this fix brought query time from 8s to
800ms.
@gromgull
Copy link
Member Author

I made the comment a bit clearer - if the test-fail is unrelated we can merge :)

joernhees added a commit that referenced this pull request Nov 22, 2015
Small but nice SPARQL Optimisation fix
@joernhees joernhees merged commit fa4172c into RDFLib:master Nov 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants