-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizer is slow: Avoid too many string cloning in the optimizer #5157
Comments
Possibly related - #4680 |
I looked at the trace and here are my observations: As @tustvold has said, if we can have A large amount of the allocations come from And a large part of that is how it ignores errors with It also appears there is copying going on in unwrap_cast_in_comparison and common subexpr eliminiate |
@zeodtr Would you be willing to contribute your benchmark program -- we have https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/benches/sql_planner.rs but maybe we could have a more end to end plan creation test 🤔 |
The result type of Giving the name, there are three results:
The first and third both will return error, Maybe we can change it to |
@alamb use std::{collections::HashMap, sync::Arc};
use datafusion::{
arrow::datatypes::{DataType, Field, Schema},
common::Result,
config::ConfigOptions,
error::DataFusionError,
logical_expr::{
logical_plan::builder::LogicalTableSource, AggregateUDF, LogicalPlan, ScalarUDF,
TableSource,
},
optimizer::{optimizer::Optimizer, OptimizerContext, OptimizerRule},
sql::{
planner::{ContextProvider, SqlToRel},
sqlparser::{dialect::GenericDialect, parser::Parser},
TableReference,
},
};
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
fn main() {
let sql = "select column1 from table1";
let schema_provider = TestSchemaProvider::new();
let now = std::time::Instant::now();
let dialect = GenericDialect {};
let ast = Parser::parse_sql(&dialect, sql).unwrap();
let statement = &ast[0];
let sql_to_rel = SqlToRel::new(&schema_provider);
let plan = sql_to_rel.sql_statement_to_plan(statement.clone()).unwrap();
println!(
"elapsed time after creating a logical plan: {}",
now.elapsed().as_millis()
);
let config = OptimizerContext::default();
let optimizer = Optimizer::new();
let optimized_plan = optimizer.optimize(&plan, &config, observe).unwrap();
println!(
"elapsed time after optimization: {}\n",
now.elapsed().as_millis()
);
println!("plan:\n{:?}\n", plan);
println!("optimized plan:\n{:?}", optimized_plan);
}
fn observe(_plan: &LogicalPlan, _rule: &dyn OptimizerRule) {}
struct TestSchemaProvider {
options: ConfigOptions,
tables: HashMap<String, Arc<dyn TableSource>>,
}
impl TestSchemaProvider {
pub fn new() -> Self {
let mut tables = HashMap::new();
tables.insert(
"table1".to_string(),
create_table_source({
let mut fields = Vec::new();
for num in 0..700 {
fields.push(Field::new(
format!("column{}", num + 1),
DataType::Int32,
false,
))
}
fields
}),
);
Self {
options: Default::default(),
tables,
}
}
}
fn create_table_source(fields: Vec<Field>) -> Arc<dyn TableSource> {
Arc::new(LogicalTableSource::new(Arc::new(
Schema::new_with_metadata(fields, HashMap::new()),
)))
}
impl ContextProvider for TestSchemaProvider {
fn get_table_provider(&self, name: TableReference) -> Result<Arc<dyn TableSource>> {
match self.tables.get(name.table()) {
Some(table) => Ok(table.clone()),
_ => Err(DataFusionError::Plan(format!(
"Table not found: {}",
name.table()
))),
}
}
fn get_function_meta(&self, _name: &str) -> Option<Arc<ScalarUDF>> {
None
}
fn get_aggregate_meta(&self, _name: &str) -> Option<Arc<AggregateUDF>> {
None
}
fn get_variable_type(&self, _variable_names: &[String]) -> Option<DataType> {
None
}
fn options(&self) -> &ConfigOptions {
&self.options
}
} And the content of Cargo.toml is as follows: [package]
name = "simple_optimizer_test"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
datafusion = "17.0.0"
mimalloc = "0.1.34"
[profile.release]
debug = 2 BTW, It seems that the performance of the optimizer worsens more than linearly as the number of the columns increases. For example, when the number of the column is 1000, the total elapsed time is about 200 msec (30 for creating the logical plan, 170 for the optimizer), but when the number of the column is 2000, the total elapsed time is about 800 msec (100 for creating the logical plan, 700 for the optimizer). |
Thank you @zeodtr -- I intend to port this into the datafusion benchmark suite eventually. |
Proposed adding in #5256 |
Here is one possible task that would help: #5309 |
Filed #5637 to track making the optimizer faster in general |
TBH String interning would be the best ⭐️ - no cloning, quick Instead of #[derive(Hash, PartialEq, Eq, Copy, Clone)]
struct StringId(usize); And some |
Agreed though then we have to thread the HashMap through Another possibility would be to use |
Looks like a good tradeoff. What do you think about 🎯 precomputed hash +
Performance almost-like string interning. |
Seems reasonable to me. Maybe as some intermediate state we could have a Then we could thread through a StringInterner when possible (e.g. on the optimizer or the sql planner) but also be able to make these strings easily without one (so we could incrementally update the code) Any such solution, I think, should strive very hard to keep the burden of working in the DataFusion codebase low (e.g. should be easy to work with for anyone used to String, be well documented, etc) |
apache/arrow-rs#3955 may also be relevant here, I'm optimistic that we can drastically reduce the overheads without needing to reach for exotic solutions like string interning, at least initially |
It is a good call @comphead -- It certainly helps 😅.
(BTW these are the benchmarks that @zeodtr contributed ❤️ ) I am also quite excited about @matthewmturner 's work in #8905 @zeodtr / @mslapek I wonder if you have an opinion about this. Shall we keep it open? Shall we close it in favor of tickets that describe specific improvements? Do you have a chance to test with more recent versions of DataFusion? |
@alarmb Since there is an epic issue(#5637), this issue can be closed in favor of more specific issues. |
I forgot out great that description was. FYI @matthewmturner for your inspiration |
Thanks again @zeodtr BTW depending on the usecase we have made datafusion planning about 2x faster between 37.0.0 and 40.0.0 (will be released shortly). There were more pronounced gains for schemas with larger numbers of columns |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'm not sure this is a feature request, but at least this is not a bug (albeit it's a performance problem), so I write this issue as a feature request.
I'm benchmarking the optimizers of DataFusion and Calcite.
I intended to compare the quality of the optimized plans between them, assuming that DataFusion's optimizing speed would be faster (since it's written in Rust).
But to my surprise, I found that Calcite's optimizer is way faster (~ 20x) in some cases.
The case is as follows:
While Calcite finished the optimization in about 7 msec, DataFusion's optimizer took about 120 msec.
At first, the number was worse, but it settled to about 120 msec when I set the global allocator to mimalloc. (I've tried snmalloc and it was somewhat faster - about 100 msec. But somehow snmalloc did not play well with valgrind, I chose mimalloc at least temporarily)
I ran the test program with valgrind / callgrind and drew the call graph. The graph showed that about half of the execution time is being spent on
<alloc::string::String as core::clone::Clone>::clone
. The call count was 3,930,814.I ran the optimizer for another table with fewer columns (about 200 columns), and it took much less time - about 12msec.
So, I suspect that the optimizer becomes slow (at least for a table with many columns) because it clones the strings related to the schema of the table too many times.
Describe the solution you'd like
Perhaps removing unnecessary cloning may help. Or, make the fields immutable and manage them with reference counted smart pointers.
Describe alternatives you've considered
No alternatives.
Additional context
The following attachment is the call graph in SVG format. It was created by gprof2dot.py and dot with callgrind's output data. 'batch_test' is the name of my test program. Somewhat contrary to the name, The program only tests one query statement.
The text was updated successfully, but these errors were encountered: