Skip to content
This repository was archived by the owner on Feb 20, 2023. It is now read-only.
This repository was archived by the owner on Feb 20, 2023. It is now read-only.

Nested ORDER BY and LIMIT crashes the DB #1423

Open
@jkosh44

Description

@jkosh44

Bug Report

Summary

According to the SQL standard ORDER BY is not allowed in nested queries (I can't actually find the SQL Standard to confirm this, but I found a handful of sites that have said this. The best source I can find is this MariaDB post: https://mariadb.com/kb/en/why-is-order-by-in-a-from-subquery-ignored/). Different systems handle this in different ways. From some brief research, these seem to be the possible options when we encounter an ORDER BY clause in a nested query:

  1. Respond back with an error saying that nested ORDER BYs aren't allowed in nested queries
  2. Ignore the nested ORDER BY
  3. Execute the nested ORDER BY
  4. Only execute the nested ORDER BY if it's accompanied by a LIMIT clause.
  • The rationale behind option 4 (I think) is that often, but not always, an ORDER BY in a nested query would not change the result of the query unless it's accompanied by a LIMIT.

Our system will ignore an ORDER BY clause unless it is in the outermost query:

// Build the QueryInfo object. For SELECTs this may require a bunch of other stuff from the original statement.
// If any more logic like this is needed in the future, we should break this into its own function somewhere since
// this is Optimizer-specific stuff.
const auto type = query->GetStatement(0)->GetType();
if (type == parser::StatementType::SELECT) {
const auto sel_stmt = query->GetStatement(0).CastManagedPointerTo<parser::SelectStatement>();
// Output
output = sel_stmt->GetSelectColumns(); // TODO(Matt): this is making a local copy. Revisit the life cycle and
// immutability of all of these Optimizer inputs to reduce copies.
// PropertySort
if (sel_stmt->GetSelectOrderBy()) {
std::vector<optimizer::OrderByOrderingType> sort_dirs;
std::vector<common::ManagedPointer<parser::AbstractExpression>> sort_exprs;
auto order_by = sel_stmt->GetSelectOrderBy();
auto types = order_by->GetOrderByTypes();
auto exprs = order_by->GetOrderByExpressions();
for (size_t idx = 0; idx < order_by->GetOrderByExpressionsSize(); idx++) {
sort_exprs.emplace_back(exprs[idx]);
sort_dirs.push_back(types[idx] == parser::OrderType::kOrderAsc ? optimizer::OrderByOrderingType::ASC
: optimizer::OrderByOrderingType::DESC);
}
auto sort_prop = new optimizer::PropertySort(sort_exprs, sort_dirs);
property_set.AddProperty(sort_prop);
}
}

or unless it is accompanied by a LIMIT clause:
void ChildPropertyDeriver::Visit(const Limit *op) {
// Limit fulfill the internal sort property
std::vector<PropertySet *> child_input_properties{new PropertySet()};
auto provided_prop = new PropertySet();
if (!op->GetSortExpressions().empty()) {
const std::vector<common::ManagedPointer<parser::AbstractExpression>> &exprs = op->GetSortExpressions();
const std::vector<OrderByOrderingType> &sorts{op->GetSortAscending()};
provided_prop->AddProperty(new PropertySort(exprs, sorts));
}
output_.emplace_back(provided_prop, std::move(child_input_properties));
}

When we actually encounter both an ORDER BY and a LIMIT in a nested query then the database crashes. Specifically with a segfault on this line:

double GetCost(PropertySet *requirements) const { return std::get<0>(lowest_cost_table_.find(requirements)->second); }

lowest_cost_table_ does not contain requirements.

Environment

OS: Ubuntu (LTS) 20.04

Compiler: GCC 7.0+

CMake Profile: Debug

Jenkins/CI: N/A

Steps to Reproduce

  1. Compile with the following args: -DCMAKE_BUILD_TYPE=Debug -DNOISEPAGE_USE_ASAN=ON
  2. Run noisepage with parallel execution turned off noisepage -parallel_execution=false
  3. Execute a nested query with an ORDER BY and a LIMIT

Expected Behavior

postgres=# CREATE TABLE foo (a int);
CREATE TABLE
postgres=# CREATE TABLE bar (a int);
CREATE TABLE
postgres=# SELECT a FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 1);
 a 
---
(0 rows)

Actual Behavior

noisepage=# CREATE TABLE foo (a int);
CREATE TABLE
noisepage=# CREATE TABLE bar (a int);
CREATE TABLE
noisepage=# SELECT a FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 1);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Discussion

ORDER BY Effects

I just briefly wanted to present some examples of when an ORDER BY could influence the result IF it was executed

For all of the following queries assume the following two tables have been created and filled with random data

CREATE TABLE foo (a int);
CREATE TABLE bar (a int);
SELECT * FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a);

Here the ORDER BY makes no difference. The result of the nested query is part of an IN predicate and therefore treated as an unordered list, so it doesn't matter how it's ordered.

SELECT * FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 2);

Here the ORDER BY does make a difference. We only include the top two values of a in the right side of the IN predicate.

SELECT * FROM (SELECT a FROM foo ORDER BY a) AS f LIMIT 2;

This is probably a stupid query, but the ORDER BY does make a difference. The result of this query should be the same as SELECT a FROM foo ORDER BY a LIMIT 2;, however if we ignore the ORDER BY this is not necessarily the case. I actually tested this on postgres and it seems like the ORDER BY is executed.

postgres=# CREATE TABLE foo (a int);
CREATE TABLE
postgres=# INSERT INTO foo VALUES(10);
INSERT 0 1
postgres=# INSERT INTO foo VALUES(15);
INSERT 0 1
postgres=# INSERT INTO foo VALUES(1);
INSERT 0 1
postgres=# SELECT a FROM foo;
 a  
----
 10
 15
  1
(3 rows)

postgres=# SELECT * FROM (SELECT a FROM foo ORDER BY a) AS f LIMIT 2;
 a  
----
  1
 10
(2 rows)

Properties

ORDER BY clauses are implemented with properties in the optimizer. That means that while there is an OrderByPlanNode for physical plans (https://github.com/cmu-db/noisepage/blob/master/src/include/planner/plannodes/limit_plan_node.h), there is no LogicalOrderBy operator in the logical operator trees. Instead we have a PropertySort property (https://github.com/cmu-db/noisepage/blob/master/src/include/optimizer/properties.h).

Currently PropertySort is the only property in the optimizer and the implementation for deriving this property (linked above in this issue) is very specific to ORDER BY and very tightly coupled to the LogicalLimit operator. Understanding this will probably be important to fixing this issue. Also I have the following concerns with this approach (other than this bug that it's causing):

  • I found it pretty confusing and it took me awhile to figure out how and why PropertySort was so tightly coupled with LIMIT. I think a comment or some documentation would go a long way here.
  • This may make adding additional properties more difficult.
    • If every property has it's own unique place and method of being derived then it can be very difficult to maintain and understand.
    • It's possible that not every property will be so tightly coupled to some other node like ORDER BY is. That means that the approach we use in deriving PropertySort, where we save the information in some other node, MIGHT not be possible. I don't really know enough about properties though to know if this is true.
    • In Optimizer does not successfully time out when query has a limit #1414 we want to convert LIMIT into a property. This may break our current derivation of PropertySort.

I don't know enough about the optimizer to know if these are valid concerns, but thought they warranted some discussion. They probably don't need to be resolved for this issue but are somewhat related so I thought I'd bring them up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working (correctness). Mark issues with this.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions