Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Backend for EXPLAIN (FORMAT TPL) queries. #1589

Merged
merged 8 commits into from
May 26, 2021

Conversation

lmwnshn
Copy link
Contributor

@lmwnshn lmwnshn commented May 23, 2021

Heading

First pass at supplying backend for EXPLAIN (FORMAT TPL) queries.

Description

#1588 provides the frontend for EXPLAIN (FORMAT ...) queries.

This PR attempts to provide the backend to that, structured as follows:

  • ExecutionSettings is currently described as "ExecutionSettings stores settings that are passed down from the upper layers.", so we are passing down a new CompilerSettings that determines what information should be gathered at compile-time. This can be set from the tcop upon detecting that this is an EXPLAIN statement of a relevant type. See comment below.
  • CompilerSettings just describes whether to collect TPL and/or TBC, and is threaded down into compiler.cpp.
  • Information is maintained in CompileTimeModuleMetadata which is currently just a unordered_map<string, string>. Two main considerations went into this decision:
    • You will either need to extend the lifetime of the objects you want metadata for, or grab the relevant information before the objects get destroyed. In this case, the AST for example is destroyed after compilation is done.
    • I cannot foresee any usages of manipulating the AST after compiler.cpp, so it might as well be serialized to string there instead of having its lifetime extended.

@lmwnshn lmwnshn added in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. ready-for-ci Indicate that this build should be run through CI. labels May 23, 2021
@lmwnshn lmwnshn self-assigned this May 23, 2021
Copy link
Contributor Author

@lmwnshn lmwnshn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some highlighted parts.

// Set any compilation settings based on the original query type.
if (portal->GetStatement()->GetQueryType() == network::QueryType::QUERY_EXPLAIN) {
execution::compiler::CompilerSettings settings;
// if (portal->GetStatement()->RootStatement().CastManagedPointerTo<parser::ExplainStatement>().GetFormat()...)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration with #1588, check if TPL / TBC should be collected.

@@ -358,6 +357,24 @@ TrafficCopResult TrafficCop::ExecuteExplainStatement(
std::vector<planner::OutputSchema::Column> output_columns;
output_columns.emplace_back("QUERY PLAN", type::TypeId::VARCHAR, nullptr);

// TODO(WAN): Integrate with Matt's PR.
#if 0
// Codegen must happen for certain types of EXPLAIN metadata to be collected, e.g., collection of TPL.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration with #1588, check the FORMAT to decide whether you're dumping the plan node, or codegenning the physical plan and then dumping the compile time metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it when a plan comes together.

std::string GetTBC() const { return map_.find("tbc") == map_.end() ? "" : map_.at("tbc"); }

private:
// TODO(WAN): The value-type may change in the future depending on what tagged dictionaries need.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turingcompl33t Any thoughts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the flexibility of the string -> string map. In my current implementation for the profiling subsystem, instances of tagging dictionaries are owned by the Statement for the associated query. This works from a lifetime perspective, but isn't great from a "separation of concerns" perspective - profiling information like the tagging dictionary is obviously far too low-level of a detail to push all the way up to the Statement.

The refactor I was considering goes something like this:

  • Make the profiling subsystem its own "top-level" layer, i.e. add ExecutionProfilingLayer to DBMain. This layer is then responsible for two main things:
    • Starting and stopping the dedicated task that collects samples at query execution time
    • Managing instances of tagging dictionaries that are populated by lower-level components of the subsystem

With this approach, we could then generate a unique identifier for each query that is used during compilation / execution to lookup the appropriate tagging dictionary that needs to be updated. It looks like all queries are already assigned a query_id, but I'm not sure of the uniqueness constraints on this identifier? Could we just invoke std::to_string() on this identifier and store it as the value for the "tags" key in the CompileTimeModuleMetadata? Just one idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how query ids work right now.

And to check that I understand this
CompileTimeModuleMetadata: {"tags": query id}
ExecutionProfilingLayer: {query id -> tagging dict}
?

@noisepage-checks
Copy link

Performance Boost!

Nice job! This PR has increased the throughput of the system.

Could not find any performance results to compare for this commit.

1 similar comment
@noisepage-checks
Copy link

Performance Boost!

Nice job! This PR has increased the throughput of the system.

Could not find any performance results to compare for this commit.

Copy link
Contributor

@turingcompl33t turingcompl33t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, overall a big fan of the design. None of my comments should stop the PR from moving forward so I'm marking as an Approve. I wrote a quick response to the question of the data structure / types for compile-time metadata collection but we can talk more about this at the meeting I imagine.

src/execution/compiler/compilation_context.cpp Outdated Show resolved Hide resolved
src/execution/compiler/compiler.cpp Show resolved Hide resolved
src/execution/compiler/executable_query.cpp Show resolved Hide resolved
src/include/execution/compiler/compilation_context.h Outdated Show resolved Hide resolved
src/include/execution/compiler/compiler_settings.h Outdated Show resolved Hide resolved
std::string GetTBC() const { return map_.find("tbc") == map_.end() ? "" : map_.at("tbc"); }

private:
// TODO(WAN): The value-type may change in the future depending on what tagged dictionaries need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the flexibility of the string -> string map. In my current implementation for the profiling subsystem, instances of tagging dictionaries are owned by the Statement for the associated query. This works from a lifetime perspective, but isn't great from a "separation of concerns" perspective - profiling information like the tagging dictionary is obviously far too low-level of a detail to push all the way up to the Statement.

The refactor I was considering goes something like this:

  • Make the profiling subsystem its own "top-level" layer, i.e. add ExecutionProfilingLayer to DBMain. This layer is then responsible for two main things:
    • Starting and stopping the dedicated task that collects samples at query execution time
    • Managing instances of tagging dictionaries that are populated by lower-level components of the subsystem

With this approach, we could then generate a unique identifier for each query that is used during compilation / execution to lookup the appropriate tagging dictionary that needs to be updated. It looks like all queries are already assigned a query_id, but I'm not sure of the uniqueness constraints on this identifier? Could we just invoke std::to_string() on this identifier and store it as the value for the "tags" key in the CompileTimeModuleMetadata? Just one idea.

@@ -358,6 +357,24 @@ TrafficCopResult TrafficCop::ExecuteExplainStatement(
std::vector<planner::OutputSchema::Column> output_columns;
output_columns.emplace_back("QUERY PLAN", type::TypeId::VARCHAR, nullptr);

// TODO(WAN): Integrate with Matt's PR.
#if 0
// Codegen must happen for certain types of EXPLAIN metadata to be collected, e.g., collection of TPL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it when a plan comes together.

@noisepage-checks
Copy link

Performance Boost!

Nice job! This PR has increased the throughput of the system.

Could not find any performance results to compare for this commit.

@noisepage-checks
Copy link

Performance Boost!

Nice job! This PR has increased the throughput of the system.

Could not find any performance results to compare for this commit.

2 similar comments
@noisepage-checks
Copy link

Performance Boost!

Nice job! This PR has increased the throughput of the system.

Could not find any performance results to compare for this commit.

@noisepage-checks
Copy link

Performance Boost!

Nice job! This PR has increased the throughput of the system.

Could not find any performance results to compare for this commit.

@lmwnshn lmwnshn added ready-to-merge This PR is ready to be merged. Mark PRs with this. and removed in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. labels May 26, 2021
@noisepage-checks
Copy link

Minor Decrease in Performance

Be warned: this PR may have decreased the throughput of the system slightly.

tps (%change) benchmark_type wal_device details
0.63% tpcc RAM disk
Detailsmaster tps=22426.47, commit tps=22568.82, query_mode=extended, benchmark_type=tpcc, scale_factor=32.0000, terminals=32, client_time=60, weights={'Payment': 43, 'Delivery': 4, 'NewOrder': 45, 'StockLevel': 4, 'OrderStatus': 4}, wal_device=RAM disk, max_connection_threads=32
-3.61% tpcc None
Detailsmaster tps=29426.38, commit tps=28363.0, query_mode=extended, benchmark_type=tpcc, scale_factor=32.0000, terminals=32, client_time=60, weights={'Payment': 43, 'Delivery': 4, 'NewOrder': 45, 'StockLevel': 4, 'OrderStatus': 4}, wal_device=None, max_connection_threads=32
-0.79% tpcc HDD
Detailsmaster tps=21850.1, commit tps=21677.48, query_mode=extended, benchmark_type=tpcc, scale_factor=32.0000, terminals=32, client_time=60, weights={'Payment': 43, 'Delivery': 4, 'NewOrder': 45, 'StockLevel': 4, 'OrderStatus': 4}, wal_device=HDD, max_connection_threads=32
4.85% tatp RAM disk
Detailsmaster tps=6537.49, commit tps=6854.35, query_mode=extended, benchmark_type=tatp, scale_factor=1.0000, terminals=16, client_time=60, weights={'GetAccessData': 35, 'UpdateLocation': 14, 'GetNewDestination': 10, 'GetSubscriberData': 35, 'DeleteCallForwarding': 2, 'InsertCallForwarding': 2, 'UpdateSubscriberData': 2}, wal_device=RAM disk, max_connection_threads=32
0.15% tatp None
Detailsmaster tps=7393.76, commit tps=7404.58, query_mode=extended, benchmark_type=tatp, scale_factor=1.0000, terminals=16, client_time=60, weights={'GetAccessData': 35, 'UpdateLocation': 14, 'GetNewDestination': 10, 'GetSubscriberData': 35, 'DeleteCallForwarding': 2, 'InsertCallForwarding': 2, 'UpdateSubscriberData': 2}, wal_device=None, max_connection_threads=32
6.83% tatp HDD
Detailsmaster tps=6406.76, commit tps=6844.42, query_mode=extended, benchmark_type=tatp, scale_factor=1.0000, terminals=16, client_time=60, weights={'GetAccessData': 35, 'UpdateLocation': 14, 'GetNewDestination': 10, 'GetSubscriberData': 35, 'DeleteCallForwarding': 2, 'InsertCallForwarding': 2, 'UpdateSubscriberData': 2}, wal_device=HDD, max_connection_threads=32

@codecov
Copy link

codecov bot commented May 26, 2021

Codecov Report

Merging #1589 (57ed045) into master (f2fd647) will increase coverage by 0.08%.
The diff coverage is 62.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1589      +/-   ##
==========================================
+ Coverage   81.12%   81.21%   +0.08%     
==========================================
  Files         745      747       +2     
  Lines       52414    52453      +39     
==========================================
+ Hits        42523    42599      +76     
+ Misses       9891     9854      -37     
Impacted Files Coverage Δ
src/execution/compiler/executable_query.cpp 50.70% <0.00%> (-0.73%) ⬇️
...c/include/execution/compiler/compilation_context.h 100.00% <ø> (ø)
src/include/execution/compiler/executable_query.h 88.88% <ø> (ø)
...lude/execution/compiler/executable_query_builder.h 100.00% <ø> (ø)
src/include/execution/vm/module.h 93.33% <0.00%> (-2.13%) ⬇️
src/include/traffic_cop/traffic_cop.h 100.00% <ø> (ø)
src/include/execution/exec/execution_settings.h 83.33% <33.33%> (-16.67%) ⬇️
src/traffic_cop/traffic_cop.cpp 73.68% <45.45%> (-1.96%) ⬇️
src/include/execution/compiler/compiler_settings.h 50.00% <50.00%> (ø)
src/include/execution/vm/module_metadata.h 60.00% <60.00%> (ø)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2fd647...57ed045. Read the comment docs.

@lmwnshn lmwnshn merged commit dbc4383 into cmu-db:master May 26, 2021
@lmwnshn lmwnshn deleted the explain_tpl_backend branch May 26, 2021 17:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ready-for-ci Indicate that this build should be run through CI. ready-to-merge This PR is ready to be merged. Mark PRs with this.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants