Skip to content

sql_extensions

Boris Glavic edited this page Nov 15, 2020 · 8 revisions

Triggering Provenance Capture

Users can ask for the provenance of a query by using the new PROVENANCE OF construct. E.g.,

PROVENANCE OF (SELECT * FROM r);

Such an expression will be rewritten into a instrumented query that returns the provenance of the input query. The result of a query instrumented for provenance capture is the original query result paired with provenance. Additional attributes are added to the result schema to store the provenance. For example, if SELECT * FROM r returns

 a | b | c
---+---+---
 1 | 2 | 3
 1 | 3 | 4
 2 | 5 | 3
 2 | 1 | 7
 3 | 4 | 8
 3 | 5 | 9

then PROVENANCE OF (SELECT * FROM r); will return:

 a | b | c | prov_r_a         | prov_r_b         | prov_r_c
---+---+---+------------------+------------------+------------------
 1 | 2 | 3 |                1 |                2 |                3
 1 | 3 | 4 |                1 |                3 |                4
 2 | 5 | 3 |                2 |                5 |                3
 2 | 1 | 7 |                2 |                1 |                7
 3 | 4 | 8 |                3 |                4 |                8
 3 | 5 | 9 |                3 |                5 |                9

Obviously this is not very useful for SELECT * ... style queries. Consider a slightly more complex query using aggregation over the table from above:

SELECT a, sum(b) AS x
FROM r
GROUP BY a;

This query returns:

 a | x
---+---
 1 | 5
 2 | 6
 3 | 9

To compute the provenance of this query run:

PROVENANCE OF (
    SELECT a, sum(b) AS x
	FROM r
    GROUP BY a
);

which returns:

 a | x | prov_r_a         | prov_r_b         | prov_r_c
---+---+---+------------------+------------------+------------------
 1 | 5 |                1 |                2 |                3
 1 | 5 |                1 |                3 |                4
 2 | 6 |                2 |                5 |                3
 2 | 6 |                2 |                1 |                7
 3 | 9 |                3 |                4 |                8
 3 | 9 |                3 |                5 |                9

A PROVENANCE OF request is treated as a query construct, e.g., it can be used as a subquery.

SELECT *
FROM
    PROVENANCE OF (SELECT * FROM r) AS p
WHERE prov_r_a = 3;

Provenance for Queries

Compute Provenance of a Query Executed in the Past

Using the AS OF clause the user can request the provenance of a query as it would have been if the query would have been executed at some time in the past. AS OF can either be specified as a timestamp TIMESTAMP '2013-12-01 08:00:00' or using an SCN SCN 12345.

PROVENANCE AS OF TIMESTAMP '2013-12-01 08:00:00' OF (SELECT * FROM r);

Provenance for Transactions

GProM also can retroactively capture the provenance of a transaction. This is currently only supported for Oracle backends and require Oracles Flashback Archive (FBA) and auditing features to be activated. Auditing is used to access the SQL statements executed by the transaction and FBA is used for time travel queries.

To capture the provenance, you have to know the transaction XID (transaction identifier) of a transaction that was executed at some point in the past. To get the internal identifier of a transaction use Oracle's audit logging facility https://blogs.oracle.com/imc/entry/oracle_database_12c_new_unified: For instance, to compute provenance for transaction 12345 run:

PROVENANCE OF TRANSACTION 12345;

The user can also select the table modified by the transaction for which provenance should be tracked. The default is the last table updated by the transaction. If you want to compute the provenance of a different table modified by the transaction then you have to use the WITH TABLE X option:

PROVENANCE WITH TABLE S OF TRANSACTION 12345;

To restrict the results to rows modified by the transaction, add the ONLY UPDATED option:

PROVENANCE WITH ONLY UPDATED OF TRANSACTION 12345;

Furthermore, to show intermediate versions of the tracked table as seen by each update of the transaction add the SHOW INTERMEDIATE option:

PROVENANCE WITH SHOW INTERMEDIATE OF TRANSACTION 12345;

When the STATEMENT ANNOTATIONS option is given then additional boolean attributes are added to the schema which store for each row which of the transaction's updates did modify this row:

PROVENANCE WITH STATEMENT ANNOTATIONS OF TRANSACTION 12345;

Custom Provenance and Controlling Provenance Capture

We enable the user to choose how far to trace back provenance and to instruct our system to use certain attributes as provenance.

By using the BASERELATION keyword after a FROM clause item within a provenance statement, the user instructs the system to not trace provenance into the subquery, but instead treat it as an input relation.

PROVENANCE OF (
    SELECT month, total
    FROM (SELECT month, creditc, SUM(amount) AS total
         FROM purchase p
         GROUP BY month, creditc) BASERELATION AS monthly
);

The query above would return explain an output tuple based on the tuples from subquery monthly that were used to derive it.

By using the HAS PROVENANCE (attrList) expression after a FROM clause item the user instructs the system to consider these attributes as the provenance of the from clause item

PROVENANCE OF (
    SELECT sum(a)
    FROM R HAS PROVENANCE (myProvAttr1, myProvAttr2)
    GROUP BY b
);

Similarly, if only certain attributes should be duplicated as provenance this can be specified by using USE PROVENANCE (attrList):

PROVENANCE OF (
    SELECT sum(a)
    FROM R USE PROVENANCE (a)
    GROUP BY b
);

The user can also ask the system to show provenance for intermediate results of a query.

PROVENANCE OF (
  SELECT count(*) AS highTotals
  FROM
             (SELECT sum(amount) AS total
             FROM accounts
             GROUP BY user
             HAVING total > 10000) SHOW INTERMEDIATE PROVENANCE AS sub
);

In this example, each tuple in the result would be paired with its provenance in the accounts table and with its provenance in the results of the subquery sub. Consider the following accounts table:

  usr   | amount
--------+--------
 Peter  |   6000
 Peter  |   8000
 Peter  |     50
 Astrid |    100
 Astrid |    100
 Astrid |   9900

The result would be

 hightotals | prov_sub_total | prov_accounts_usr        | prov_accounts_amount
------------+----------------+--------------------------+-----------------------------
          2 |          14050 | Peter                    |                        6000
          2 |          14050 | Peter                    |                        8000
          2 |          14050 | Peter                    |                          50
          2 |          10100 | Astrid                   |                         100
          2 |          10100 | Astrid                   |                         100
          2 |          10100 | Astrid                   |                        9900

Reenacting Operations

GProM also supports reenactment of a list of statements provided by the user. These statements can either be executed over the current database state or over a past state (currently this is only works for Oracle backends if Oracle's time travel features are activated). For example, to get the state of relation R produced by the updates:

UPDATE R SET a = a + 2 WHERE b = 5;
DELETE FROM R WHERE b > 3;)

you would have to use the reenactment request:

REENACT (UPDATE R SET a = a + 2 WHERE b = 5; DELETE FROM R WHERE b > 3;);

The reenact statement can also be used to track provenance for the reenacted operations:

REENACT WITH PROVENANCE (UPDATE R SET a = a + 2 WHERE b = 5; UPDATE R SET b = 6 WHERE b <= 5;);

For reenacting with provenance, we also support the options influencing provenance tracking as described above for provenance capture for transactions, e.g.,

REENACT WITH PROVENANCE ONLY UPDATED SHOW INTERMEDIATE STATEMENT ANNOTATIONS (UPDATE R SET a = 2 WHERE B = 1; INSERT INTO R VALUES (10,20););

GProM also has limited support for DDL commands in reenactment, e.g.,

REENACT (CREATE TABLE mytab (a int, b int); INSERT INTO mytab VALUES (1,1); INSERT INTO mytab VALUES (2,3); UPDATE mytab SET a = a + 2 WHERE b = 3;);

would yield:

A|B|
----
1|1|
4|3|

Note that provenance tracking for DDL commands is currently not supported yet.