Our current config of datafusion.execution.collect_statistics feels like a nuclear option that blindly collects all statistics even if they are not useful for the query at hand.
Two cases I see that are egregious.
Pointless statistics for all columns on wide tables
SELECT *
FROM t
ORDER BY ts
LIMIT 10
With dynamic filters from the TopK operator there is a real advantage to having file-level statistics: we use these to skip files before opening them again if we can prove they won't factor into the topk heap. However on a wide table we collect and store statistics for all columns in the table even if the only stats used are for ts. For datafusion-cli the main waste is storing all of those stats in memory. This is not a small problem: there's been a lot of recent work on caching statistics w/ bounded memory (e.g. #19052) in an effort to reduce memory consumption from statistics. In a query like this we could reduce memory consumption by something like 99% if the table is wide, other columns are strings (larger stats), etc.
For ListingTable memory consumption is the main issue. Since stats are collected from parquet thrift footers you at least have to read all of the stats into memory (there's some ideas to skip decoding, but that's not implemented yet). However for systems like we have at Pydantic (which store stats in parquet files) or Ducklake (which stores stats in Postgres) it's trivial and beneficial to only collect stats for some columns. Thus there's also an IO price being paid to collect useless stats.
Obviously for a query like SELECT * FROM t there is not point in collecting any stats at all.
Collecting stats for files that are never touched
Another issue is collecting stats for files that are never touched.
For example:
SELECT val
FROM t
WHERE val = 'abc'
LIMIT 10
This query can push the limit into the scan such that we may stop after reading the first file. Stats on val may be useful to skip entire files, but if there are 100s of files and we are able to satisfy the limit by e.g. skipping 3 then reading 2 it was pointless to collect stats for the other 95 files; we could have never even read the footer from those.
Collecting statistics is buried inside of ListingTable
There are no generalized APIs for "I have a source of statistics", everyone has to build their own and inject the stats into the PartitionedFiles they return from ListingTable.
Proposal
I've started poking at this in #21157 but I think the solution is to be lazier about collecting statistics and to decide what stats to collect / store based on the needs of the query.
One idea is that instead of 1 step of "execute query" we have:
- Stats collection based on the query's requirements (which columns, all files or up to some limit within each partition / scan queue). I imagine this could be a good place to introduce things like a join operator requesting a sample of cardinality for a column (maybe randomly sample files, randomly sample row groups, randomly sample pages). This is like a mini query in that it requires IO, CPU work, uses memory, etc.
- Actual execution.
Getting the right APIs and heuristics for (1) is going to be the hard part. For example, is it the topk operator that asks for stats on the columns it is sorting on? The stats may not make it back up to the TopK (e.g. if there is a join in the middle). Even then the topk operator itself may not even use the stats, it just wants the scans to use them for pruning. Joins are a bit different: the join operator wants cardinality stats to choose join order (this would probably be an optimizer as well... making it even more unclear what the APIs should be).
This is kind of happening right now already but as per above it's implicitly buried in ListingTable: we do a run of stats collection during planning (which as per above is a mix of IO and CPU) but it's kind of ad-hoc, not tied into knowledge of the query or optimizer needs and not thought of as a public API.
Related issues
#21443
#19052
#19487
Our current config of
datafusion.execution.collect_statisticsfeels like a nuclear option that blindly collects all statistics even if they are not useful for the query at hand.Two cases I see that are egregious.
Pointless statistics for all columns on wide tables
With dynamic filters from the TopK operator there is a real advantage to having file-level statistics: we use these to skip files before opening them again if we can prove they won't factor into the topk heap. However on a wide table we collect and store statistics for all columns in the table even if the only stats used are for
ts. Fordatafusion-clithe main waste is storing all of those stats in memory. This is not a small problem: there's been a lot of recent work on caching statistics w/ bounded memory (e.g. #19052) in an effort to reduce memory consumption from statistics. In a query like this we could reduce memory consumption by something like 99% if the table is wide, other columns are strings (larger stats), etc.For
ListingTablememory consumption is the main issue. Since stats are collected from parquet thrift footers you at least have to read all of the stats into memory (there's some ideas to skip decoding, but that's not implemented yet). However for systems like we have at Pydantic (which store stats in parquet files) or Ducklake (which stores stats in Postgres) it's trivial and beneficial to only collect stats for some columns. Thus there's also an IO price being paid to collect useless stats.Obviously for a query like
SELECT * FROM tthere is not point in collecting any stats at all.Collecting stats for files that are never touched
Another issue is collecting stats for files that are never touched.
For example:
This query can push the limit into the scan such that we may stop after reading the first file. Stats on
valmay be useful to skip entire files, but if there are 100s of files and we are able to satisfy the limit by e.g. skipping 3 then reading 2 it was pointless to collect stats for the other 95 files; we could have never even read the footer from those.Collecting statistics is buried inside of
ListingTableThere are no generalized APIs for "I have a source of statistics", everyone has to build their own and inject the stats into the
PartitionedFiles they return fromListingTable.Proposal
I've started poking at this in #21157 but I think the solution is to be lazier about collecting statistics and to decide what stats to collect / store based on the needs of the query.
One idea is that instead of 1 step of "execute query" we have:
Getting the right APIs and heuristics for (1) is going to be the hard part. For example, is it the topk operator that asks for stats on the columns it is sorting on? The stats may not make it back up to the TopK (e.g. if there is a join in the middle). Even then the topk operator itself may not even use the stats, it just wants the scans to use them for pruning. Joins are a bit different: the join operator wants cardinality stats to choose join order (this would probably be an optimizer as well... making it even more unclear what the APIs should be).
This is kind of happening right now already but as per above it's implicitly buried in
ListingTable: we do a run of stats collection during planning (which as per above is a mix of IO and CPU) but it's kind of ad-hoc, not tied into knowledge of the query or optimizer needs and not thought of as a public API.Related issues
#21443
#19052
#19487