feat: Implement the statistics_cache function#19054
Conversation
statistics_cache function
| let sql = "SELECT split_part(path, '/', -1) as filename, file_size_bytes, num_rows, num_columns, table_size_bytes from statistics_cache() order by filename"; | ||
| let df = ctx.sql(sql).await?; | ||
| let rbs = df.collect().await?; | ||
| assert_snapshot!(batches_to_string(&rbs),@r" | ||
| ++ | ||
| ++ | ||
| "); |
There was a problem hiding this comment.
This confirms that the file statistics cache is not populated when the table is created, only after accessing it once.
There was a problem hiding this comment.
Do we want to provide a pre-warming option to it? Similar to metadataCache?
There was a problem hiding this comment.
I think so, it's being implemented here: #18971.
There was a problem hiding this comment.
| num_rows: stats.num_rows, | ||
| num_columns: stats.column_statistics.len(), | ||
| table_size_bytes: stats.total_byte_size, | ||
| statistics_size_bytes: 0, // TODO: set to the real size in the future |
There was a problem hiding this comment.
In the future we need to set this to the real size.
|
Also the |
|
There is a problem with the spell checking in the output of the function example but I don't know how to deal with it since it should be treated verbatim. cc: @alamb |
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
alamb
left a comment
There was a problem hiding this comment.
Thank you @nuno-faria and @martin-g -- this looks great
I filed a follow on ticket to install a default session level statistics cache
I also merged up from main to resolve some conflicts and added another test -- let me know what you think
| /// | ||
| /// If `None`, creates a new [`DefaultFileStatisticsCache`] scoped to this query. | ||
| pub fn with_cache(mut self, cache: Option<FileStatisticsCache>) -> Self { | ||
| pub fn with_cache(mut self, cache: Option<Arc<dyn FileStatisticsCache>>) -> Self { |
There was a problem hiding this comment.
technically this is an API change, but I think it is a good one and makes the FileStatisticsCache consistent with the rest-- will mark the PR as API change
|
|
||
| #[tokio::test] | ||
| async fn test_statistics_cache() -> Result<(), DataFusionError> { | ||
| let file_statistics_cache = Arc::new(DefaultFileStatisticsCache::default()); |
|
Thanks @alamb.
LGTM. |
|
Thanks again @nuno-faria , @martin-g and @alchemist51 |
Which issue does this PR close?
datafusion-cli#18953.Rationale for this change
Allow a way to check the contents of the file statistics cache.
What changes are included in this PR?
statistics_cachefunction todatafusion-cli.FileStatisticsCacheto a trait and implemented thelist_entriesmethod.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes,
FileStatisticsCachehas been changed to a trait. Previous implementations need to implement thelist_entriesmethod.