Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Nested ORDER BY and LIMIT crashes the DB #1423

Open
jkosh44 opened this issue Dec 29, 2020 · 4 comments
Open

Nested ORDER BY and LIMIT crashes the DB #1423

jkosh44 opened this issue Dec 29, 2020 · 4 comments
Labels
bug Something isn't working (correctness). Mark issues with this.

Comments

@jkosh44
Copy link
Contributor

jkosh44 commented Dec 29, 2020

Bug Report

Summary

According to the SQL standard ORDER BY is not allowed in nested queries (I can't actually find the SQL Standard to confirm this, but I found a handful of sites that have said this. The best source I can find is this MariaDB post: https://mariadb.com/kb/en/why-is-order-by-in-a-from-subquery-ignored/). Different systems handle this in different ways. From some brief research, these seem to be the possible options when we encounter an ORDER BY clause in a nested query:

  1. Respond back with an error saying that nested ORDER BYs aren't allowed in nested queries
  2. Ignore the nested ORDER BY
  3. Execute the nested ORDER BY
  4. Only execute the nested ORDER BY if it's accompanied by a LIMIT clause.
  • The rationale behind option 4 (I think) is that often, but not always, an ORDER BY in a nested query would not change the result of the query unless it's accompanied by a LIMIT.

Our system will ignore an ORDER BY clause unless it is in the outermost query:

// Build the QueryInfo object. For SELECTs this may require a bunch of other stuff from the original statement.
// If any more logic like this is needed in the future, we should break this into its own function somewhere since
// this is Optimizer-specific stuff.
const auto type = query->GetStatement(0)->GetType();
if (type == parser::StatementType::SELECT) {
const auto sel_stmt = query->GetStatement(0).CastManagedPointerTo<parser::SelectStatement>();
// Output
output = sel_stmt->GetSelectColumns(); // TODO(Matt): this is making a local copy. Revisit the life cycle and
// immutability of all of these Optimizer inputs to reduce copies.
// PropertySort
if (sel_stmt->GetSelectOrderBy()) {
std::vector<optimizer::OrderByOrderingType> sort_dirs;
std::vector<common::ManagedPointer<parser::AbstractExpression>> sort_exprs;
auto order_by = sel_stmt->GetSelectOrderBy();
auto types = order_by->GetOrderByTypes();
auto exprs = order_by->GetOrderByExpressions();
for (size_t idx = 0; idx < order_by->GetOrderByExpressionsSize(); idx++) {
sort_exprs.emplace_back(exprs[idx]);
sort_dirs.push_back(types[idx] == parser::OrderType::kOrderAsc ? optimizer::OrderByOrderingType::ASC
: optimizer::OrderByOrderingType::DESC);
}
auto sort_prop = new optimizer::PropertySort(sort_exprs, sort_dirs);
property_set.AddProperty(sort_prop);
}
}

or unless it is accompanied by a LIMIT clause:
void ChildPropertyDeriver::Visit(const Limit *op) {
// Limit fulfill the internal sort property
std::vector<PropertySet *> child_input_properties{new PropertySet()};
auto provided_prop = new PropertySet();
if (!op->GetSortExpressions().empty()) {
const std::vector<common::ManagedPointer<parser::AbstractExpression>> &exprs = op->GetSortExpressions();
const std::vector<OrderByOrderingType> &sorts{op->GetSortAscending()};
provided_prop->AddProperty(new PropertySort(exprs, sorts));
}
output_.emplace_back(provided_prop, std::move(child_input_properties));
}

When we actually encounter both an ORDER BY and a LIMIT in a nested query then the database crashes. Specifically with a segfault on this line:

double GetCost(PropertySet *requirements) const { return std::get<0>(lowest_cost_table_.find(requirements)->second); }

lowest_cost_table_ does not contain requirements.

Environment

OS: Ubuntu (LTS) 20.04

Compiler: GCC 7.0+

CMake Profile: Debug

Jenkins/CI: N/A

Steps to Reproduce

  1. Compile with the following args: -DCMAKE_BUILD_TYPE=Debug -DNOISEPAGE_USE_ASAN=ON
  2. Run noisepage with parallel execution turned off noisepage -parallel_execution=false
  3. Execute a nested query with an ORDER BY and a LIMIT

Expected Behavior

postgres=# CREATE TABLE foo (a int);
CREATE TABLE
postgres=# CREATE TABLE bar (a int);
CREATE TABLE
postgres=# SELECT a FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 1);
 a 
---
(0 rows)

Actual Behavior

noisepage=# CREATE TABLE foo (a int);
CREATE TABLE
noisepage=# CREATE TABLE bar (a int);
CREATE TABLE
noisepage=# SELECT a FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 1);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Discussion

ORDER BY Effects

I just briefly wanted to present some examples of when an ORDER BY could influence the result IF it was executed

For all of the following queries assume the following two tables have been created and filled with random data

CREATE TABLE foo (a int);
CREATE TABLE bar (a int);
SELECT * FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a);

Here the ORDER BY makes no difference. The result of the nested query is part of an IN predicate and therefore treated as an unordered list, so it doesn't matter how it's ordered.

SELECT * FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 2);

Here the ORDER BY does make a difference. We only include the top two values of a in the right side of the IN predicate.

SELECT * FROM (SELECT a FROM foo ORDER BY a) AS f LIMIT 2;

This is probably a stupid query, but the ORDER BY does make a difference. The result of this query should be the same as SELECT a FROM foo ORDER BY a LIMIT 2;, however if we ignore the ORDER BY this is not necessarily the case. I actually tested this on postgres and it seems like the ORDER BY is executed.

postgres=# CREATE TABLE foo (a int);
CREATE TABLE
postgres=# INSERT INTO foo VALUES(10);
INSERT 0 1
postgres=# INSERT INTO foo VALUES(15);
INSERT 0 1
postgres=# INSERT INTO foo VALUES(1);
INSERT 0 1
postgres=# SELECT a FROM foo;
 a  
----
 10
 15
  1
(3 rows)

postgres=# SELECT * FROM (SELECT a FROM foo ORDER BY a) AS f LIMIT 2;
 a  
----
  1
 10
(2 rows)

Properties

ORDER BY clauses are implemented with properties in the optimizer. That means that while there is an OrderByPlanNode for physical plans (https://github.com/cmu-db/noisepage/blob/master/src/include/planner/plannodes/limit_plan_node.h), there is no LogicalOrderBy operator in the logical operator trees. Instead we have a PropertySort property (https://github.com/cmu-db/noisepage/blob/master/src/include/optimizer/properties.h).

Currently PropertySort is the only property in the optimizer and the implementation for deriving this property (linked above in this issue) is very specific to ORDER BY and very tightly coupled to the LogicalLimit operator. Understanding this will probably be important to fixing this issue. Also I have the following concerns with this approach (other than this bug that it's causing):

  • I found it pretty confusing and it took me awhile to figure out how and why PropertySort was so tightly coupled with LIMIT. I think a comment or some documentation would go a long way here.
  • This may make adding additional properties more difficult.
    • If every property has it's own unique place and method of being derived then it can be very difficult to maintain and understand.
    • It's possible that not every property will be so tightly coupled to some other node like ORDER BY is. That means that the approach we use in deriving PropertySort, where we save the information in some other node, MIGHT not be possible. I don't really know enough about properties though to know if this is true.
    • In Optimizer does not successfully time out when query has a limit #1414 we want to convert LIMIT into a property. This may break our current derivation of PropertySort.

I don't know enough about the optimizer to know if these are valid concerns, but thought they warranted some discussion. They probably don't need to be resolved for this issue but are somewhat related so I thought I'd bring them up.

@jkosh44 jkosh44 added the bug Something isn't working (correctness). Mark issues with this. label Dec 29, 2020
@thepinetree
Copy link
Contributor

Addressing some things based on my own observations, I agree with the refactor of LIMIT as a property, since it has much the same behavior of an ORDER BY (an additional node in the output, but one which can also be propagated down to lower nodes).

To my knowledge, the only property initially envisoned was a Sort Property, so at the moment I don't think that any new properties would not be able to be coupled to a particular node, or pushed down otherwise. LIMITs in particular should be able to be pushed down to the child Get though I'm not sure how nested queries break this, as it seems the currently do.

Our current rules for PropertySort should not break if a PropertyLimit is added because properties of a specific type are accessed with the GetPropertyOfType function in the PropertySet class, though if this is not done anywhere (i.e. just indexed into the property set), it should be replaced as such. However, this information will be more useful with the idea of an 'optional property,' one that might be satisfied by a child node, and will no longer require the output of a new Limit or OrderBy node where the property existed.

A bunch of property-related ideas are listed on #1421, if ok with you, I'd like to add a link to this issue in that checklist.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 1, 2021

@thepinetree Feel free to link this issue.

I guess I don't fully understand the difference between when we create a PropertySort here:

// Build the QueryInfo object. For SELECTs this may require a bunch of other stuff from the original statement.
// If any more logic like this is needed in the future, we should break this into its own function somewhere since
// this is Optimizer-specific stuff.
const auto type = query->GetStatement(0)->GetType();
if (type == parser::StatementType::SELECT) {
const auto sel_stmt = query->GetStatement(0).CastManagedPointerTo<parser::SelectStatement>();
// Output
output = sel_stmt->GetSelectColumns(); // TODO(Matt): this is making a local copy. Revisit the life cycle and
// immutability of all of these Optimizer inputs to reduce copies.
// PropertySort
if (sel_stmt->GetSelectOrderBy()) {
std::vector<optimizer::OrderByOrderingType> sort_dirs;
std::vector<common::ManagedPointer<parser::AbstractExpression>> sort_exprs;
auto order_by = sel_stmt->GetSelectOrderBy();
auto types = order_by->GetOrderByTypes();
auto exprs = order_by->GetOrderByExpressions();
for (size_t idx = 0; idx < order_by->GetOrderByExpressionsSize(); idx++) {
sort_exprs.emplace_back(exprs[idx]);
sort_dirs.push_back(types[idx] == parser::OrderType::kOrderAsc ? optimizer::OrderByOrderingType::ASC
: optimizer::OrderByOrderingType::DESC);
}
auto sort_prop = new optimizer::PropertySort(sort_exprs, sort_dirs);
property_set.AddProperty(sort_prop);
}
}
and here:
void ChildPropertyDeriver::Visit(const Limit *op) {
// Limit fulfill the internal sort property
std::vector<PropertySet *> child_input_properties{new PropertySet()};
auto provided_prop = new PropertySet();
if (!op->GetSortExpressions().empty()) {
const std::vector<common::ManagedPointer<parser::AbstractExpression>> &exprs = op->GetSortExpressions();
const std::vector<OrderByOrderingType> &sorts{op->GetSortAscending()};
provided_prop->AddProperty(new PropertySort(exprs, sorts));
}
output_.emplace_back(provided_prop, std::move(child_input_properties));
}
I had sort of assumed that they were similar ways of creating a property for the query in different places, but I guess that they are used for different reasons?

For the outer query a PropertySort will be created in both places, but for nested queries a PropertySort will only be created in the second block of code, which might be the cause of this error.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 1, 2021

Ignoring the following method (because I don't really understand it),

void ChildPropertyDeriver::Visit(const Limit *op) {
// Limit fulfill the internal sort property
std::vector<PropertySet *> child_input_properties{new PropertySet()};
auto provided_prop = new PropertySet();
if (!op->GetSortExpressions().empty()) {
const std::vector<common::ManagedPointer<parser::AbstractExpression>> &exprs = op->GetSortExpressions();
const std::vector<OrderByOrderingType> &sorts{op->GetSortAscending()};
provided_prop->AddProperty(new PropertySort(exprs, sorts));
}
output_.emplace_back(provided_prop, std::move(child_input_properties));
}
we only derive properties for the outer query by checking if it has an ORDER BY clause
// Build the QueryInfo object. For SELECTs this may require a bunch of other stuff from the original statement.
// If any more logic like this is needed in the future, we should break this into its own function somewhere since
// this is Optimizer-specific stuff.
const auto type = query->GetStatement(0)->GetType();
if (type == parser::StatementType::SELECT) {
const auto sel_stmt = query->GetStatement(0).CastManagedPointerTo<parser::SelectStatement>();
// Output
output = sel_stmt->GetSelectColumns(); // TODO(Matt): this is making a local copy. Revisit the life cycle and
// immutability of all of these Optimizer inputs to reduce copies.
// PropertySort
if (sel_stmt->GetSelectOrderBy()) {
std::vector<optimizer::OrderByOrderingType> sort_dirs;
std::vector<common::ManagedPointer<parser::AbstractExpression>> sort_exprs;
auto order_by = sel_stmt->GetSelectOrderBy();
auto types = order_by->GetOrderByTypes();
auto exprs = order_by->GetOrderByExpressions();
for (size_t idx = 0; idx < order_by->GetOrderByExpressionsSize(); idx++) {
sort_exprs.emplace_back(exprs[idx]);
sort_dirs.push_back(types[idx] == parser::OrderType::kOrderAsc ? optimizer::OrderByOrderingType::ASC
: optimizer::OrderByOrderingType::DESC);
}
auto sort_prop = new optimizer::PropertySort(sort_exprs, sort_dirs);
property_set.AddProperty(sort_prop);
}
}

If we want to support a PropertyLimit, or support PropertySort for nested queries, or add another property that applies to nested queries I think that we'll need a new way of initially deriving and storing properties. A way that can derive properties for the outer and all inner queries and then stores those properties separately somehow. I'm not sure what the best way to do this is though. Maybe deriving the properties during the binder because we already traverse through all outer and nested queries. Maybe visiting all the statement nodes an additional time to get all the properties.

jkosh44 added a commit to jkosh44/noisepage that referenced this issue Jan 3, 2021
Currently if there's an ORDER BY and a LIMIT in a nested query, then
we get a segfault. The issue is that ORDER BY is converted into a
PropertySort, and properties are not pushed down into nested queries.
Additionally LIMIT nodes look at the sort expressions and order that
is stored in the node itself instead of looking at the required
properties during ChildPropertyDeriver::Visit. This creates a
disconnect between what the LIMIT nodes in nested queries expects
the required properties to be and what they actually are. Relying on
the actual properties instead of what's stored in the LIMIT node
ensures that the output properties and required properties stay in
synce.

The result of this change is that nested queries will ignore ORDER
BYs unless it is accompanied by a LIMIT. This is due to the fact that
the PropertySort will not be pushed down into nested queries, but
LIMITs store their own sort information and generate an OrderBy node
in the PlanGenerator.

Fixes cmu-db#1423
@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 3, 2021

jkosh44@fefcc60 Fixes this issue. Though it's probably still worth considering how to propagate properties down into nested queries.

The change should wait until after #1422 is merged or be integrated into #1422. It's possible depending on the final state of #1422 that this change is no longer necessary (@thepinetree tagging for visibility).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working (correctness). Mark issues with this.
Projects
None yet
Development

No branches or pull requests

2 participants