-
Notifications
You must be signed in to change notification settings - Fork 6
pssketch
GProM supports a technique called provenance-based data skipping (PBDS). PBDS works as follows: you can instruct GProM to create a provenance sketch for one or more tables accessed by a query Q (we call this sketch capture). Such a sketch encodes at the granularity of a (user provided) range-partition of the table for which the sketch was created which fragments of the partition contain at least one row of provenance for any query result. Intuitively, a sketch encodes compactly a subset of the input database that is sufficient for evaluating the query, i.e., running the query on this subset yields the same result as evaluating the query over the full database. Thus, sketches can be used to speed up query evaluation by filtering data that does not belong to the sketch early-on in the query (we refer to this as using a sketch). A sketch captured for a query Q
can be used to answer Q
again in the future (assuming the data has not changed) or to answer similar queries. The later is of more interest in practice. GProM tracks sketches and can store them in your database and has support for automatically determining whether an existing sketch can be reused to evaluate a query. Currently, reuse is only possible among queries that have the same structure, but differ in constants used in selection conditions. Note that this requires solving provenance containment, a problem closely related to query containment. We use an SMT solver (Z3). That means you have to install Z3 and compile GProM with support for Z3 (compile GProM with --with-libz3=YOUR_Z3_LOCATION
). For more information about the research behind PBDS see here.
Note that there is no requirement that the partitions used to create sketches have to correspond to a physical partitioning of the table. That being said, if a the partition used for building a sketch corresponds to a physical partition of the table, then this can have performance advantages for sketch use.
Theoretically, a sketch can be build on any horizontal partitioning of a table. However, range-based partitioning is preferable because range-based sketches can be transformed into selection conditions during sketch use which enables to database to exploit existing physical design (e.g., zonemaps, indexes, ...) to speed-up filtering out data that does not belong to a sketch.
A range partitioning of a table, splits the table horizontally based on the set of ranges R = {[l1,u1], ..., [ln,un] }
over the domain of values of an attribute A
(or set of attributes) such that all tuples whose A
value falls withing a range r
are grouped into the fragment corresponding to r
. For instance,
| Name | Salary | |--------+--------| | Peter | 35 | | Bob | 5 | | Alice | 10 | | Astrid | 50 |
Using ranges R = { [1,20], [21,100] }
for to partition this table on salary we get two partitions:
Partition for range [1,20]
.
| Name | Salary | |--------+--------| | Bob | 5 | | Alice | 10 |
Partition for range [21,100]
.
| Name | Salary | |--------+--------| | Peter | 35 | | Astrid | 50 |
PROVENANCE WITH COARSE GRAINED RANGESA ()
OF (SELECT ...)
CAPTURE PROVENANCE WITH COARSE GRAINED RANGESB (oc_posts (p_owneruserid 1,1808,3657,5678,7970,10785,13618,16602,19370,22174,25172,13651166 ), oc_users (u_id 1,1808,3657,5678,7970,10785,13618,13651166)
To capture sketches for table R
on attribute A
using the following ranges {[1,10], [11,20], [21,30], [31,40]}
and table S
on attribute C
using the following ranges {[1,50],[51,100]}
CAPTURE PROVENANCE WITH COARSE GRAINED
RANGESB (R (A 1,10,20,30,40 ), S (C 1,50,100)
OF (SELECT ...)
To capture sketches for attributes R.A
and S.B
, creating 32
fragments each:
CAPTURE PROVENANCE WITH COARSE GRAINED
FRAGMENT (R(A) 32, S(B) 32)
OF (SELECT ...)
PROVENANCE WITH COARSE GRAINED HASH ()
OF (SELECT ...)
PROVENANCE WITH COARSE GRAINED PAGE ()
OF (SELECT ...)