Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

treat a query as a progressive funnel instead of independent clauses to search and join #467

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

latacora-paul
Copy link

@latacora-paul latacora-paul commented Apr 24, 2024

Hey!

Let's consider this a work in progress since I broke something around lookup refs (will look into it) and I know you might want to make some different implementation decisions. However, the query tests are passing and this greatly reduces the number of datoms being fetched from storage in many cases.

Your fix for #462 addressed the "constant / one possibility" case but it didn't do anything to constrain the search space in most other cases (like when there are only a few possible values for a binding as determined by an earlier where clause).

Here I'm constraining the search space of where clauses using candidate binding values (which considerably decreases the number of datoms that need to be scanned for clauses appearing later in a query). In other words, this handles both the "constant" case and allows relations from earlier where clauses to turn later searches into "a collection of constant cases".

@tonsky
Copy link
Owner

tonsky commented Apr 25, 2024

If I understand your approach right:

  • First clause does some basic filtering, e.g. [10 ?a ?v] produces some relation with N amount of [?a ?v] pairs. E.g. [:name 1] [:name 2] [:age 3] [:age 4]
  • If second clause has either ?a or ?v, you will run N small queries against database. E.g. if second clause is [?e :attr ?v], you will run separate lookup-pattern for ?e :attr 1, ?e :attr 2, ?e :attr 3 and ?e :attr 4

I agree that this is more precise and will fetch less datoms. I am not sure about the performance of it though. What if first clause returned 100 possible values for ?v? It might be that just fetching all [?e :attr ?v] and then doing a hash-join might be faster? Maybe we need a benchmark to prove it works?

I am also not sure where to draw the line. Sure, for one value of ?v, adding it to the pattern is faster. For maybe 2-3, too. For 100, not so sure. When to stop?

Do you have any insight into how Datomic does it? Or XTDB maybe?

@latacora-paul
Copy link
Author

Yeah, what you described is correct. I'm pretty sure datomic does do something like this which is why the ordering of where clauses matters so much there for performance. I see there's some code in datahike to do it too.

Like you said, I am not sure whether it is better in all cases and where the threshold is - likely depends on whether data is already in memory, the size of the set being searched, the number of search patterns remaining from earlier clauses, the ordering of your where clauses, etc.

I will say this considerably reduces the real-world time and number of datoms being fetched for some queries I've been running against a large database using a dynamodb storage backend. Benchmarking sounds reasonable given this is pretty central to query performance (or perhaps even making it optional behavior).

@galdre
Copy link
Contributor

galdre commented May 9, 2024

I have also experimented with a farily similar approach to this one, and found that it was sometimes much faster and sometimes far slower for my app; I also experimented with inserting an arbitrary threshold, but overall I strongly suspect (partly due to experience with eva):

  1. It will greatly depend on the shape and scale of your data.
  2. Performance characteristics will dramatically differ based on whether or not you are using persistence (not to mention the latency of the persistence).

One low-hanging approach that might be beneficial to use cases like those of @latacora-paul could be to switch strategies not based on a threshold, but simply based on whether or not the data is persistently stored.

Another approach I've been mulling is to use Clojure/script's metadata to be able to hint to the query engine specific strategies to override defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants