-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: optimize table info() and describe() for large column tables efficiently #9639
Comments
Can you separate out the time it takes to execute from the time it takes to construct the expression. Timing: # how long the expression takes to construct
expr = t.describe()
# how long it takes to compile
ibis.to_sql(expr)
# and how long it takes to execute (time for below, minus time for compile above)
expr.execute() |
import ibis
import pandas as pd
import numpy as np
num_rows = 1500000
num_cols = 450
data = np.random.rand(num_rows, num_cols)
columns = [f'col_{i}' for i in range(num_cols)]
df = pd.DataFrame(data, columns=columns)
memtable = ibis.memtable(df) I tested
The above running time is acceptable for this size. The issue I met in the kaggle competition, I ran the
I set the duckdb configuration
but it did not help. |
Looking into this, a large source of the Ibis overhead here is the 450 relations we are constructing, each of which must go through binding and dereferencing neither of which is particularly cheap. I poked around DuckDB's I tried a similar approach here, and after adding some benchmarks I'm seeing about an 11x (!) speed up in expression construction:
The caveat here is that not all backends support arrays, so we won't be able to see this improvement in every backend. |
For the curious, here's the DuckDB plan where you can see the
|
A follow up to #9684 might actually make a proper |
Even before that, we could probably move this entire thing to compilation time which would make this much cheaper for both the union case and the array case ... I'll look into it. |
Hi @cpcloud Thank you so much for your prompt check and fix. |
Is your feature request related to a problem?
we have table.info() and
table.describe()
for Ibis table. The function loops over each column and performs multiple aggregations and form the output by unioning all the stats for each columnit is often used for univariate analysis by DS, When I tried these two functions on a bigger dataset (465 columns, 1.5m rows), I have two issues:
table.info()
,describe()
throws the Out of Memory exceptionThe
info()
majorly calculate the sum, mean of null rows, but it is very slow. Is it very expensive to calculate this on duckDB?The describe() throws Out of Memory exception, it will generate 465 small tables for the union, not sure why it takes so much memory.
What is the motivation behind your request?
These two functions are very useful for univariate analysis.
Describe the solution you'd like
Could we do some batch or parallel computing?
What version of ibis are you running?
9.1.0
What backend(s) are you using, if any?
DuckDB
Code of Conduct
The text was updated successfully, but these errors were encountered: