-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalidate ORC metadata cache based on file modification time #24346
Invalidate ORC metadata cache based on file modification time #24346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I have with this solution is that the file modification times are also cached by the directory lister implementation. If the caching directory lister is enabled, then we won't get updated file modification times, and this solution won't solve the original issue.
I traced the call paths to where we populate the fileLastModified time and it comes from the result of directoryLister.list
inside of StoragePartitionLoader.java
. If the listing result is cached, we won't get the updated file modification time. "Fixing" the issue would mean bypassing the listing cache which would also have a pretty significant impact on performance. It might be easier to just provide a procedure to invalidate the cache - either through a JMX procedure or SQL procedure.
presto-orc/src/main/java/com/facebook/presto/orc/OrcReader.java
Outdated
Show resolved
Hide resolved
Please add the PR number to the release note entry, and nit rephrasing following the Order of changes in the Release Notes Guidelines.
|
Agree. All caches which do time based invalidation will come with the caveat that file list caching may interfere with its functioning. There is a stored procedure |
290adc3
to
3138555
Compare
@ZacBlanco Addressed the comments. Please review. @agrawalreetika @pratyakshsharma can you please have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes. LGTM
@hantangwangd Can you please review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change, overall looks good to me, this can alleviate a significant portion of the problems caused by cache stale.
Bring one question for discussing: when we enable the cached directoryLister and a new file is added into any partition of a table through an external application without modify existing ones, do we still unable to recognize it?
return delegate.getStripeFooterSlice(orcDataSource, stripeId, footerOffset, footerLength, cacheable); | ||
CacheableSlice cacheableSlice = footerSliceCache.getIfPresent(stripeId); | ||
if (cacheableSlice != null) { | ||
if (cacheableSlice.getFileModificationTime() == fileModificationTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little question, what is more appropriate when cacheableSlice.getFileModificationTime() > fileModificationTime
? Is it intentional to invalidate and refill the cache in such scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are assuming that new fileModificationTime will always be same or more recent than cacheableSlice.getFileModificationTime().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok.
Right, it will not recognize the new file because the CachingDirectoryLister will have the stale file list. |
Thanks, glad to confirm this. |
@hantangwangd @ZacBlanco Can one of you merge this please? I see the merge button is disabled for me even after approvals. |
We need a committer approval - @tdcmeehan could you help? |
Hey guys. We are seeing the following errors due to the conflict between this pr and #24351. @nmahadevuni @tdcmeehan
|
The trivial fix is in #24421 |
Description
Supports invalidation of ORC metadata cache entries based on the file modification time.
Motivation and Context
For traditional Hive ORC tables, after Presto reads and caches ORC metadata, if other tools modify the ORC files in-place. Presto would try to read old offsets in the new file resulting in errors like "Malformed ORC file..". This PR implements support to invalidate the cache entries based on the file modified timestamp.
Impact
No impact.
Test Plan
Added tests to AbstractTestOrcReader.java.
Release Notes
Please follow release notes guidelines and fill in the release notes below.