-
Notifications
You must be signed in to change notification settings - Fork 6
sql_extensions
Users can ask for the provenance of a query by using the new PROVENANCE OF
construct. E.g.,
PROVENANCE OF (SELECT * FROM r);
Such an expression will be rewritten into a instrumented query that returns the provenance of the input query. The result of a query instrumented for provenance capture is the original query result paired with provenance. Additional attributes are added to the result schema to store the provenance. For example, if SELECT * FROM r
returns
a | b | c
---+---+---
1 | 2 | 3
1 | 3 | 4
2 | 5 | 3
2 | 1 | 7
3 | 4 | 8
3 | 5 | 9
then PROVENANCE OF (SELECT * FROM r);
will return:
a | b | c | prov_r_a | prov_r_b | prov_r_c
---+---+---+------------------+------------------+------------------
1 | 2 | 3 | 1 | 2 | 3
1 | 3 | 4 | 1 | 3 | 4
2 | 5 | 3 | 2 | 5 | 3
2 | 1 | 7 | 2 | 1 | 7
3 | 4 | 8 | 3 | 4 | 8
3 | 5 | 9 | 3 | 5 | 9
Obviously this is not very useful for SELECT * ...
style queries. Consider a slightly more complex query using aggregation over the table from above:
SELECT a, sum(b) AS x
FROM r
GROUP BY a;
This query returns:
a | x
---+---
1 | 5
2 | 6
3 | 9
To compute the provenance of this query run:
PROVENANCE OF (
SELECT a, sum(b) AS x
FROM r
GROUP BY a
);
which returns:
a | x | prov_r_a | prov_r_b | prov_r_c
---+---+---+------------------+------------------+------------------
1 | 5 | 1 | 2 | 3
1 | 5 | 1 | 3 | 4
2 | 6 | 2 | 5 | 3
2 | 6 | 2 | 1 | 7
3 | 9 | 3 | 4 | 8
3 | 9 | 3 | 5 | 9
A PROVENANCE OF
request is treated as a query construct, e.g., it can be used as a subquery.
SELECT *
FROM
PROVENANCE OF (SELECT * FROM r) AS p
WHERE prov_r_a = 3;
Using the AS OF
clause the user can request the provenance of a query as it would have been if the query would have been executed at some time in the past. AS OF
can either be specified as a timestamp TIMESTAMP '2013-12-01 08:00:00'
or using an SCN SCN 12345
.
PROVENANCE AS OF TIMESTAMP '2013-12-01 08:00:00' OF (SELECT * FROM r);
GProM also can retroactively capture the provenance of a transaction. This is currently only supported for Oracle backends and require Oracles Flashback Archive (FBA) and auditing features to be activated. Auditing is used to access the SQL statements executed by the transaction and FBA is used for time travel queries.
To capture the provenance, you have to know the transaction XID (transaction identifier) of a transaction that was executed at some point in the past. To get the internal identifier of a transaction use Oracle's audit logging facility https://blogs.oracle.com/imc/entry/oracle_database_12c_new_unified: For instance, to compute provenance for transaction 12345
run:
PROVENANCE OF TRANSACTION 12345;
The user can also select the table modified by the transaction for which provenance should be tracked. The default is the last table updated by the transaction.
If you want to compute the provenance of a different table modified by the transaction then you have to use the WITH TABLE X
option:
PROVENANCE WITH TABLE S OF TRANSACTION 12345;
To restrict the results to rows modified by the transaction, add the ONLY UPDATED
option:
PROVENANCE WITH ONLY UPDATED OF TRANSACTION 12345;
Furthermore, to show intermediate versions of the tracked table as seen by each update of the transaction add the SHOW INTERMEDIATE
option:
PROVENANCE WITH SHOW INTERMEDIATE OF TRANSACTION 12345;
When the STATEMENT ANNOTATIONS
option is given then additional boolean attributes are added to the schema which store for each row which of the transaction's updates did modify this row:
PROVENANCE WITH STATEMENT ANNOTATIONS OF TRANSACTION 12345;
We enable the user to choose how far to trace back provenance and to instruct our system to use certain attributes as provenance.
By using the BASERELATION
keyword after a FROM
clause item within a provenance statement, the user instructs the system to not trace provenance into the subquery, but instead treat it as an input relation.
PROVENANCE OF (
SELECT month, total
FROM (SELECT month, creditc, SUM(amount) AS total
FROM purchase p
GROUP BY month, creditc) BASERELATION AS monthly
);
The query above would return explain an output tuple based on the tuples from subquery monthly
that were used to derive it.
By using the HAS PROVENANCE (attrList)
expression after a FROM
clause item the user instructs the system to consider these attributes as the provenance of the from clause item
PROVENANCE OF (
SELECT sum(a)
FROM R HAS PROVENANCE (myProvAttr1, myProvAttr2)
GROUP BY b
);
Similarly, if only certain attributes should be duplicated as provenance this can be specified by using USE PROVENANCE (attrList)
:
PROVENANCE OF (
SELECT sum(a)
FROM R USE PROVENANCE (a)
GROUP BY b
);
The user can also ask the system to show provenance for intermediate results of a query.
PROVENANCE OF (
SELECT count(*) AS highTotals
FROM
(SELECT sum(amount) AS total
FROM accounts
GROUP BY user
HAVING total > 10000) SHOW INTERMEDIATE PROVENANCE AS sub
);
In this example, each tuple in the result would be paired with its provenance in the accounts table and with its provenance in the results of the subquery sub
. Consider the following accounts table:
usr | amount
--------+--------
Peter | 6000
Peter | 8000
Peter | 50
Astrid | 100
Astrid | 100
Astrid | 9900
The result would be
hightotals | prov_sub_total | prov_accounts_usr | prov_accounts_amount
------------+----------------+--------------------------+-----------------------------
2 | 14050 | Peter | 6000
2 | 14050 | Peter | 8000
2 | 14050 | Peter | 50
2 | 10100 | Astrid | 100
2 | 10100 | Astrid | 100
2 | 10100 | Astrid | 9900
GProM also supports reenactment of a list of statements provided by the user. These statements can either be executed over the current database state or over a past state (currently this is only works for Oracle backends if Oracle's time travel features are activated). For example, to get the state of relation R
produced by the updates:
UPDATE R SET a = a + 2 WHERE b = 5;
DELETE FROM R WHERE b > 3;)
you would have to use the reenactment request:
REENACT (UPDATE R SET a = a + 2 WHERE b = 5; DELETE FROM R WHERE b > 3;);
The reenact statement can also be used to track provenance for the reenacted operations:
REENACT WITH PROVENANCE (UPDATE R SET a = a + 2 WHERE b = 5; UPDATE R SET b = 6 WHERE b <= 5;);
For reenacting with provenance, we also support the options influencing provenance tracking as described above for provenance capture for transactions, e.g.,
REENACT WITH PROVENANCE ONLY UPDATED SHOW INTERMEDIATE STATEMENT ANNOTATIONS (UPDATE R SET a = 2 WHERE B = 1; INSERT INTO R VALUES (10,20););
GProM also has limited support for DDL commands in reenactment, e.g.,
REENACT (CREATE TABLE mytab (a int, b int); INSERT INTO mytab VALUES (1,1); INSERT INTO mytab VALUES (2,3); UPDATE mytab SET a = a + 2 WHERE b = 3;);
would yield:
A|B|
----
1|1|
4|3|
Note that provenance tracking for DDL commands is currently not supported yet.