fix: pre-warm listing file statistics cache during listing table creation#18971
fix: pre-warm listing file statistics cache during listing table creation#18971alamb merged 6 commits intoapache:mainfrom
Conversation
|
|
||
| // Pre-warm statistics cache if collect_statistics is enabled | ||
| if session_state.config().collect_statistics() { | ||
| let _ = table.list_files_for_scan(state, &[], None).await?; |
There was a problem hiding this comment.
- Is it okay to reuse this method to pre-warm as we do couple more things post collecting the statistics ?
- Also is no limit fine ? as list_file_statistics_cache doesn't seem to have any size limit unlike metadata cache ?
cc: @alamb
There was a problem hiding this comment.
2. Also is no limit fine ?
I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.
There was a problem hiding this comment.
Thanks @martin-g for reviewing.
Agree on having limit.
But doing it in background will result in inconsistent behavior ?
Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default ListingTableProvider in DataFusion. Defaults to true.
Will a user not expect the statistics to be collected when creating the table and expect any query post that to be optimized based on the above documentation ?
|
|
||
| // Pre-warm statistics cache if collect_statistics is enabled | ||
| if session_state.config().collect_statistics() { | ||
| let _ = table.list_files_for_scan(state, &[], None).await?; |
There was a problem hiding this comment.
2. Also is no limit fine ?
I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.
…tion Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
8b8a49e to
50eaaf5
Compare
|
Can you help decide following to move the PR forward ?
|
|
My concern is that if there are many files then the pre-warming of the cache may slow down the main operation. I am not sure how many is too many though. But I guess you could leave it as is for now and optimize it only after there is an evidence that it really causes slow downs for someone. |
|
Thank you @bharath-techie -- I am checking this one out |
|
In our internal, we do stats collection in a separate task. Background task is a good idea, but based on the current situation, the PR looks like the fastest way to have stats while creating table. |
Thanks @martin-g -- I do agree that is a concern. However, I think the idea is that any subsequent query is going to collect the statistics anyways, so this PR simply moves which statement triggers the collection |
|
I took the liberty of pushing some commits that fixed the CI failures and merging up from main |
alamb
left a comment
There was a problem hiding this comment.
Thank you @bharath-techie -- this makes sense to me 🙏
|
Thank you also @xudong963 and @martin-g for the reviews
@xudong963 is this something we should write down / document / implement? Maybe not the actual behavior of pre-fetching statistics in the background but some method or something to make it easier for others to call? |
|
Thanks @alamb for the review and changes :) Also Is that by design , do you think we need to add similar limit for this cache too ? |
Yeah, I would open an issue for this later |
I think it is an oversight and we should add a similar limit. @nuno-faria actually asked the same question here: I will file a subsequent ticket to track that as well |
|
Filed a ticket to track limiting the statistics cache: |
|
Thanks again @bharath-techie and @martin-g and @xudong963 |
Pre-warm listing file statistics cache during create listing table flow as suggested in #18952.
Reused
list_files_for_scanto pre-warm.Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes unit tested.
Are there any user-facing changes?