Error if subqueries contain more than one result #5651

frankmcsherry · 2021-02-08T03:12:18Z

This PR demonstrates (perhaps incorrectly) how subqueries can be checked for multiple results, and yield errors if more than one record is returned for any key. I haven't worked in this code for a while, so it is possible that the random numbers I've picked are no longer the right ones, and there are better ways to do things.

The gist is that before returning results in a SELECT subquery, we also count the number of occurrences of each of the leading attributes. Any counts greater than one result in an error. The error could be enriched to present both the keys and the count, with some more attention.

This current makes MZ unrunnable, with the error

thread 'main' panicked at 'internal error: failed to load bootstrap view:
pg_type
error:
more than one record produced in subquery', src/coord/src/catalog.rs:633:29

This change is

frankmcsherry · 2021-02-08T03:17:08Z

This approach introduces complexity to subquery plans, but in the case that the subquery can be statically seen to produce only a single record for each key, we should be able to optimize it down to the original plan: reduce calls on distinct keys are turned in to maps (and count specifically turns in to a 1), and then filters that see constant propagated information should be set to false and the whole branch replaced by the empty collection. It might not optimize out, but that would be an optimization bug or two to chase down.

For subqueries that cannot be statically seen to produce single results, this maintains a count for each key, which seems about as harmless as you could make it. The maintained state could get 2x smaller if we could push the filter into the reduce, but that is a ways out still.

benesch · 2021-02-08T03:43:08Z

So I looked into the pg_type issue a bit. Commenting out the view definition from builtins.rs means we get a bootable server, so we can manually type EXPLAIN. The EXPLAIN itself hits the subquery error, so strongly suggestive of something in the optimizer tripping over the error, and not an actual problem with the subquery.

EXPLAIN PLAN FOR SELECT
    mz_types.oid,
    mz_types.name AS typname,
    mz_schemas.oid AS typnamespace,
    typtype,
    0::pg_catalog.oid AS typrelid,
    NULL::pg_catalog.oid AS typelem,
    coalesce(
        (
            SELECT
                t.oid
            FROM
                mz_catalog.mz_array_types AS a
                JOIN mz_catalog.mz_types AS t ON a.type_id = t.id
            WHERE
                a.element_id = mz_types.id
        ),
        0
    )
        AS typarray,
    NULL::pg_catalog.oid AS typreceive,
    false::pg_catalog.bool AS typnotnull,
    0::pg_catalog.oid AS typbasetype
FROM
    mz_catalog.mz_types
    JOIN mz_catalog.mz_schemas ON mz_schemas.id = mz_types.schema_id
    JOIN (
            SELECT type_id, 'a' AS typtype FROM mz_catalog.mz_array_types
            UNION ALL SELECT type_id, 'b' FROM mz_catalog.mz_base_types
            UNION ALL SELECT type_id, 'l' FROM mz_catalog.mz_list_types
            UNION ALL SELECT type_id, 'm' FROM mz_catalog.mz_map_types
            UNION ALL SELECT type_id, 'p' FROM mz_catalog.mz_pseudo_types
        )
            AS t ON mz_types.id = t.type_id

benesch · 2021-02-08T03:54:12Z

Well, the issue is ColumnKnowledge, which just explodes if it sees a literal error. That seems wrong, I guess, though our rules about when we promise not to evaluate something are really murky.

benesch · 2021-02-08T04:09:29Z

Ok, I pushed up a fix for that issue. That gets the server booted up properly. There is now another issue, which will be exposed when CI runs SLTs, in which this call to unwrap

materialize/src/transform/src/dataflow.rs

Line 168 in 7d5e096

.unwrap();

panics due to this query:

    CREATE VIEW github_2293 AS SELECT 1 AS a
    SELECT 1::decimal(38,0) / 0 / a + 1 FROM github_2293;

Not quite sure what code is to blame here, but that call to unwrap looks dangerous even if the fault lies elsewhere.

benesch · 2021-02-08T05:03:42Z

So, I'm not really sure what to do about this. The problem is that we used to assume that building a dataflow was an infallible operation, and now it is not. The flow for a, say, CREATE MATERIALIZED VIEW statement looks like this:

Plan query. Can fail.
Optimize query. Can fail.
Atomically insert view and index into catalog. Can fail, but succeeds or fails atomically.
Build dataflow. Must not fail, because once we've inserted into the catalog, we're committed.
Ship dataflow. Must not fail for the same reason.

So, (4) is now a fallible operation. Therefore it needs to happen before we update the catalog. Unfortunately that doesn't work because building a dataflow depends on having an updated catalog.

I think the proper fix for this probably involves interactive catalog transactions. E.g.:

let plan = plan(sql_string)?;
let plan = optimize(plan)?;
let cat_tx = begin_catalog_transaction()
cat_tx.create_view(plan.view_name);
cat_tx.create_index(plan.index_name, plan.view_name);
let dataflow = build_index_dataflow(cat_tx, plan.index_name)?;
cat_tx.commit();
ship_dataflow(dataflow);

That's going to be a heavy lift though. We don't have anything like that cat_tx object at the moment. Perhaps there is something hacky we can do in the DataflowBuilder so it can find the in-progress index/view definitions via a side channel, rather than looking for them in the catalog.

philip-stoev · 2021-02-08T10:41:31Z

@benesch It seems to me that MaterializeInc/database-issues#1715 is an example of either point 4 or point 5 from your list above.

frankmcsherry · 2021-02-08T12:00:17Z

I think MaterializeInc/database-issues#1715 is point 5, in that we ship a dataflow that should be fine to build but we error building it. There is a fix up for it. I'm not sure I'd call it point 4, in that I think the only "corruption" is that the catalog contains an installed dataflow that will error on construction (so, the catalog correctly persists a correct thing that tickles a bug downstream).

frankmcsherry · 2021-02-08T12:07:55Z

panics due to this query:

Is it obvious that it should panic? Like, it is a query that can be statically determined to be an error, but that's fine and we should probably reduce it down to that error and then return the error? Or is the transform operation meant to return evaluation errors itself, rather than optimize the expression down to something that may be a literal error?

frankmcsherry · 2021-02-08T12:28:40Z

Different statement: I don't know what the intent of the Result return from optimize and transform should mean. It seems to be both "something when wrong in optimization" and "we found an eval error along the way". Feels like maybe the latter could be handled more delicately? It doesn't seem to mean that transformation has failed, just that the optimized expression will error.

benesch · 2021-02-08T14:45:14Z

Yeah, agree that those two things are conflated when they ought not to be. We can maybe buy ourselves some time by handling EvalErrors more delicately. That unwrap still stresses me out though! Even if we handle EvalErrors separately, what makes that unwrap safe?

frankmcsherry · 2021-02-08T14:47:45Z

The only remaining error type is a fixed point evaluation error, which .. yeah could still happen but is now unexpected vs EvalError which we do expect. Seems like a decent spot fix, and can then file an issue that other transform errors can take down MZ (and we can think about restringing things so that this could error, and see what that breaks).

benesch · 2021-02-08T14:49:21Z

Yeah, my thoughts exactly!

Allow RelationExpr::Constant to represent constant errors. This makes it possible for the optimizer to stuff any errors it generates during constant folding back into the RelationExpr so that those errors occur during execution of the query rather than during optimization. This unblocks an issue uncovered in MaterializeInc#5651.

benesch · 2021-02-09T05:44:15Z

I rebased this on #5669 locally and SLT is largely happy, modulo an errant subquery error or two that may or may not be legit!

* transform: don't propagate expression errors in ColumnKnowledge The ColumnKnowledge transformation looks at every literal in the expression. If any of those literals are errors, it returns those errors to the client immediately. This behavior is arguably wrong, because those errors might never be evaluated. Rewire ColumnKnowledge to instead pack these errors into the DatumKnowledge it builds, rather than returning them to the client. * expr: allow constant errors Allow RelationExpr::Constant to represent constant errors. This makes it possible for the optimizer to stuff any errors it generates during constant folding back into the RelationExpr so that those errors occur during execution of the query rather than during optimization. This unblocks an issue uncovered in #5651. * strip lifted maps from first constant row * improve err display * more bugfixes * ensure all workers build the same dataflow * fold constant errors too * correct some pgtests again Co-authored-by: Frank McSherry <fmcsherry@me.com>

frankmcsherry · 2021-02-09T09:51:00Z

Thanks @benesch. This now starts up and has the following behavior:

materialize=> create table t1 (f1 integer);
CREATE TABLE
materialize=> insert into t1  values (1),(2);
INSERT 0 2
materialize=> select 1, (select * from t1) from t1;
ERROR:  Evaluation error: more than one record produced in subquery
materialize=>

Related: should we have a policy of requiring PRs to contain any @philip-stoev discovered errors as regression tests?

frankmcsherry · 2021-02-09T10:25:50Z

Okies, now that it is up and running, there are naturally lots of regressions in query plans. Time to check those out.

frankmcsherry · 2021-02-09T10:58:03Z

The only regressions are in chbench.slt and tpch.slt around global aggregations, which we cannot easily determine have a key (because we have to do the tap dance where we put a default value in if there isn't a value, union project negate stuff). It appears all of the other subqueries used key-based results, or were better structured with e.g. EXISTS and IN.

Edit: that wasn't entirely correct. TPCH Q02 seems to have a regression; I'll look in to that.

Edit: although the plans are currently gross, I don't foresee performance problems making things around global aggregates gross, because there should only be very few records in there.

frankmcsherry · 2021-02-09T11:04:58Z

I think I see the issue here, which is that the multiplicity counting happens after the "outer join" work that jams on nulls if a group was empty. That isn't helpful, because we lose key information and clearly adding on those nulls only takes the count from zero to one, not to anything beyond. I'm going to look and see if it is easy to intervene and stitch in an error before that happens.

frankmcsherry · 2021-02-09T11:07:20Z

For concreteness, here is the plan fragment that has me worried, from TPCH Q02:

...
%12 = Let l2 =
| Join %7 %8 %9 %10 %11 (= #0 MaterializeInc/database-issues#1) (= MaterializeInc/database-issues#2 MaterializeInc/database-issues#5) (= MaterializeInc/database-issues#8 MaterializeInc/materialize#13) (= MaterializeInc/materialize#15 MaterializeInc/database-issues#11)
| | implementation = DeltaQuery
| |   delta %7 %8.(#0) %9.(#0) %10.(#0) %11.(#0)
| |   delta %8 %7.(#0) %9.(#0) %10.(#0) %11.(#0)
| |   delta %9 %10.(#0) %11.(#0) %8.(#1) %7.(#0)
| |   delta %10 %11.(#0) %9.(#3) %8.(#1) %7.(#0)
| |   delta %11 %10.(#2) %9.(#3) %8.(#1) %7.(#0)
| | demand = (#0, MaterializeInc/database-issues#4, MaterializeInc/database-issues#12)
| Filter (#18 = "EUROPE")
| Reduce group=(#0)
| | agg min(#4)

%13 =
| Get %12 (l2)
| Negate
| Project (#0)

%14 =
| Union %13 %6
| Map null

%15 = Let l3 =
| Union %12 %14

%16 =
| Get %5 (l0)
| ArrangeBy (#0, MaterializeInc/database-issues#13)

%17 =
| Get %15 (l3)
| Reduce group=(#0)
| | agg count(true)
| Filter (#1 > 1)
| Map (err: more than one record produced in subquery)
| Project (#0, MaterializeInc/database-issues#2)
...

Notice how %17 which does the counting and error production depends on %15 which stitches nulls on to things missing from %12, which seems to have been the better thing to count for error purposes.

frankmcsherry · 2021-02-09T11:31:06Z

Ah, so what is happening here is that TPCH Q02 is a global aggregation, with no key, although it is correlated. As such, it gets its null values (edit: its default values, not necessarily null, which is important and probably why it happens there) packed on when we lower the global aggregation, rather than later on when we call lookup to get out of the SELECT.

frankmcsherry · 2021-02-09T12:20:20Z

Random additional thought: this can be implemented at nearly zero overhead, if instead of doing

select
    .project(input.arity())
    .count()
    .filter(|x| x > 1)
    .map(err)

we do

select
    .project(input.arity())
    .concat(input.negate())
    .threshold()
    .map(err)

The stateful operator in here, threshold only keeps state proportional to the input collection size, and for all subqueries that have exactly one result they will be absent from this collection. It would instead track only those keys with zero or multiple results.

The threshold state appears sufficient to implement the complexity that is fn lookup, for what it's worth.

edit: the input needs a distinct before it is used, which might scupper the argument.

benesch · 2021-02-13T20:02:39Z

I haven't been following this discussion super closely, @frankmcsherry. Let me know if I can help rubber duck things, or somesuch!

frankmcsherry · 2021-02-13T20:58:02Z

There are some plan regressions when subqueries use key-less aggregation (pretty common in tpch). Also, testdrive is failing with:

> SHOW DATABASES WHERE (SELECT name = name)
  | rows didn't match; sleeping to see if dataflow catches up 125ms 250ms 500ms 125ms
  | catalog.td:95:1: error: executing query failed: db error: ERROR: Evaluation error: more than one record produced in subquery
  | \|
  | 94 \| materialize
  | 95 \| > SHOW DATABASES WHERE (SELECT name = name)
  | \| ^
  | Failed! giving up: testdrive-svc --aws-region=us-east-2 --ci-output '*.td' 'esoteric/*.td'

frankmcsherry · 2021-02-14T00:25:27Z

The plan regressions would be pretty easy to fix with a transform analysis, but they are vexed by the fact that we only know how to produce "key" information by calling typ(), and baking in complex analyses there seems like a long term bad call.

frankmcsherry · 2021-02-14T01:40:00Z

I put in a fair bit of a hack to MirRelationExpr::typ() to notice the union idiom and provide unique keys in that case. At some point someone will want to do the work to turn this in to a transform analysis instead of being part of the type, but that sounds like a lot of work. This removes all of the planning regressions, except from joins.slt where we had the wrong plan because there was no unique key structure.

frankmcsherry · 2021-02-14T01:50:46Z

@benesch it's pretty non-obvious what the query plans are for SHOW DATABASES. Searching around in the codebase did not reveal where Statement::ShowDatabases gets turned in to a dataflow, and you can't type explain plan for show databases where (select name = name). So, I'm blocked on that.

philip-stoev · 2021-03-19T18:04:04Z

@frankmcsherry the test recorded the situation as it was back then. I will push a fix shortly.

frankmcsherry · 2021-03-20T00:01:47Z

Hi @wangandi! I am going to try and clean up the conflicts here, but I thought it might be good to have you take a look at it, as it touches several things we've discussed recently. I may try and break it in to a few PRs; the commits here are trying to sort out if it was possible without breaking all the tests, and I think so. However, it may be worth breaking out a few stages:

Add the Reduce that counts results from a subquery and errors if there are more than one.
Add the key logic that notices global aggregations, and provides a key even in that case (optimizes away the above).
Add various predicate pushdown guards that do not push down literal errors, and that normalize predicate orders sorting literal errors last.

I'll clean up the conflicts, and if it seems appropriate I can re-do as a few PRs with the above structure (until I can't, and then I'll come back explaining how it is hard and I failed).

frankmcsherry · 2021-03-22T03:07:15Z

This is now erroring on a test

CREATE TABLE two_rows (f1 INTEGER, f2 INTEGER);
INSERT INTO two_rows VALUES (1, 1), (2, 1);

SELECT * 
FROM two_rows AS o 
WHERE (SELECT f1 FROM two_rows AS i WHERE o.f1 = i.f1) = 1 
  AND f1 = 2;

where it returns the empty set, but the test expects to see an error. The subquery doesn't look like it should return multiple results. I changed it to use the f2 columns, which does return an error as these columns do not contain unique values like f1.

It was also erroring on the LIMIT 1 query, so perhaps it was just a bad merge somewhere along the rebase.

wangandi · 2021-03-22T22:08:30Z

src/expr/src/relation/mod.rs

+                // TODO: perhaps ensure that (above) A.proj(key) is a
+                // subset of B, as otherwise there are negative records
+                // and who knows what is true (not expected, but again
+                // who knows what the query plan might look like).


Is this why are we looking for such a complicated pattern instead of just B-A, where B has unique key and A also has the same unique key ?

I'm not sure I follow exactly, but .. if A is not a subset of B then there could be negative records.

frankmcsherry requested a review from benesch February 8, 2021 03:12

benesch mentioned this pull request Feb 8, 2021

expr: allow constant errors #5669

Merged

frankmcsherry force-pushed the error_subqueries branch from 4f4dfda to 4b32642 Compare February 9, 2021 09:50

frankmcsherry requested review from wangandi and removed request for benesch March 19, 2021 23:57

frankmcsherry force-pushed the error_subqueries branch from 7b901d4 to 4d43286 Compare March 22, 2021 00:46

frankmcsherry marked this pull request as ready for review March 22, 2021 03:30

wangandi approved these changes Mar 22, 2021

View reviewed changes

frankmcsherry and others added 18 commits March 29, 2021 09:03

error if subqueries contain more than one result

a5d6a89

Improve naming and documentation

1d7fb31

Correct formatting

b541f03

correct subquery tests

0f3a71a

commit regressed plans for inspection

db973f5

correct joins.slt

4c34ff4

introduce horrible hack

4af89cf

cargo fmt

65c6419

take care pushing down literal errors

b309604

document behavior

57509b3

various stopgap fixes

af50486

carefully evaluate error predicates last

1f4dffd

horrible branch hack to inhibit simple path

f851719

additional tests around scalar subqueries that return more than 1 row

7a53188

push the actual test

f446b59

recover from rebase

224fb62

update tests

2f8c9d3

correct tests post-rebase

f7e2de5

frankmcsherry force-pushed the error_subqueries branch from 44df4e8 to f7e2de5 Compare March 29, 2021 13:14

frankmcsherry merged commit 49c862d into MaterializeInc:main Apr 2, 2021

benesch mentioned this pull request Jul 9, 2021

Lowering: Fix multiplicity error #7359

Merged

frankmcsherry deleted the error_subqueries branch March 8, 2022 13:25

Error if subqueries contain more than one result #5651

Error if subqueries contain more than one result #5651

Uh oh!

Conversation

frankmcsherry commented Feb 8, 2021 • edited by materialize-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankmcsherry commented Feb 8, 2021

Uh oh!

benesch commented Feb 8, 2021

Uh oh!

benesch commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benesch commented Feb 8, 2021

Uh oh!

benesch commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philip-stoev commented Feb 8, 2021

Uh oh!

frankmcsherry commented Feb 8, 2021

Uh oh!

frankmcsherry commented Feb 8, 2021

Uh oh!

frankmcsherry commented Feb 8, 2021

Uh oh!

benesch commented Feb 8, 2021

Uh oh!

frankmcsherry commented Feb 8, 2021

Uh oh!

benesch commented Feb 8, 2021

Uh oh!

benesch commented Feb 9, 2021

Uh oh!

frankmcsherry commented Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankmcsherry commented Feb 9, 2021

Uh oh!

frankmcsherry commented Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankmcsherry commented Feb 9, 2021

Uh oh!

frankmcsherry commented Feb 9, 2021

Uh oh!

frankmcsherry commented Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankmcsherry commented Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benesch commented Feb 13, 2021

Uh oh!

frankmcsherry commented Feb 13, 2021

Uh oh!

frankmcsherry commented Feb 14, 2021

Uh oh!

frankmcsherry commented Feb 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankmcsherry commented Feb 14, 2021

Uh oh!

philip-stoev commented Mar 19, 2021

Uh oh!

frankmcsherry commented Mar 20, 2021

Uh oh!

frankmcsherry commented Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangandi Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

frankmcsherry Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frankmcsherry commented Feb 8, 2021 •

edited by materialize-bot

Loading

benesch commented Feb 8, 2021 •

edited

Loading

benesch commented Feb 8, 2021 •

edited

Loading

frankmcsherry commented Feb 9, 2021 •

edited

Loading

frankmcsherry commented Feb 9, 2021 •

edited

Loading

frankmcsherry commented Feb 9, 2021 •

edited

Loading

frankmcsherry commented Feb 9, 2021 •

edited

Loading

frankmcsherry commented Feb 14, 2021 •

edited

Loading

frankmcsherry commented Mar 22, 2021 •

edited

Loading