-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft][RFC] Multi-tiered file cache #8891
Comments
Thanks for the proposal Curious how are we deciding on limiting the files getting mmapped in Tier-1, is it based on the total system memory that is available. The one thing that is not pretty clear from a lifecycle management of Tier-1 is the memory management and page caching. Are files in Tier-1 guaranteed to be served from memory, can kernel decide to page out files to disk. While I understand the lifecycle policy is based on ttl and LFU policies, I couldn't find a way whether files hosted on remote store can make their way to Tier-1 and what are these quotas and what the criteria on choosing these sizes. It's also important to understand how the downloads from remote store would be done, if we aren't doing Direct-IO for instance, data from Tier-2 could end up thrashing Tier-1 mmapped files |
Thanks @abiesps! I agree with @Bukhtawar's question about blocks. How do you see the block-based partial file fetch fitting into this scheme? Also agree that it's not clear how files move up the tiers. Presumably at the moment a file is needed by an indexing, merge, or search operation and it is in tier-2 or 3 then it gets promoted to tier-1? If the file was in tier-3 in this scenario then this would be where the block-based partial fetch could make a lot of sense. Is tier-1 strictly mmap or would it reuse the logic in the |
I like the spirit of this, but I think these tiers are hard-coding behavior that makes assumptions about what storage is slower and which is faster, which is more available or which is not, which is strongly consistent and which is not. As an example, until very recently S3 was not strongly consistent across reads and writes, but now is. An S3 bucket in the same AZ may be blazingly fast, while an S3 bucket on the other side of the world may be very slow. Or, should we really always assume that slow local storage (tier 1) is faster than really fast remote storage (tier 3)? This makes me wonder, why do we need tiers at all? Maybe it would be simpler for a store implementation to advertise a caching strategy, and an administrator can override that strategy if needed. Then one can fine-tune depending on their requirements with some sensible defaults. Finally, I think we need to make sure anything we insert in the stack isn't mandatory overhead - the minimum version of a cache is pass-through. |
@dblock I think this is actually a key building block for a tier-less user experience. This issue is discussing a pretty low-level component, but I would see this as a part of the "sensible default" for the common scenario of fast local storage (e.g. SSD) and a relatively slower remote object store (e.g. same-region S3). |
Thank you for the feedback @Bukhtawar , @andrross , @dblock .Please find the answers to the questions Honestly its hard to understand where will these tiers be hosted, memory, disk or just file pointers ?
Have we evaluated a block level cache that works for caching data(maybe 512Kb blocks)across shards that could be both backed entirely on local disk and remote store. Having the cache sharded, limits the risk of random assignments thrashing one shard.
I couldn't find a way whether files hosted on remote store can make their way to Tier-1 and what are these quotas and what the criteria on choosing these size
It's also important to understand how the downloads from remote store would be done, if we aren't doing Direct-IO for instance, data from Tier-2 could end up thrashing Tier-1 mmapped files
Is tier-1 strictly mmap or would it reuse the logic in the FSDirectoryFactory to use mmap, NIO, or a mix of the two based on the index configuration?
|
Introduction
As an extension to writeable warm tier and fetch-on-demand composite directory feature in OpenSearch we are proposing to introduce a new abstraction which makes it seamless for composite directory abstraction to read files without worrying about the locality of the file.
Today in OpenSearch (with remote store ), for the shards to be able to accept reads or writes, there is a need to download full data on disk from remote store.
With the file cache (and fetch-on-demand composite directory ) the shards can accept reads and writes without keeping whole data on local store.
Proposed Solution
We are proposing a new FileCache abstraction that manages the lifecycle of all committed (segment) files for the shards present on a node.
FileCache provides three storage tiers for the files.
As the node boots up file cache is initialised with a maximum capacity of tier1 and tier2, where capacity is defined as the total size in bytes of the files associated with entries in file cache.
Each shard’s primary persist its working set as part of the commit, where working set contains the names of the segment file and the respective file cache tiers these files are present in. During shard recovery, working set of a shard is used to populate the file cache with segment files across different tiers.
An entry is created in tier2 and tier3 for all the files after every upload to remote store. Newly written files that are not committed are not managed by FileCache.
File cache internally has a periodic file lifecycle manager that evaluates lifecycle policies for each tier and take action on it.
Lifecycle policies for a file cache can be set as part of cluster settings. We are defining following major cluster settings for the same
Tier1 lifecycle policies
Files which are not actively used in tier1 (0 reference count) are eligible for movement from tier1 to tier2 based on following policies.
Tier2 lifecycle policies
Files which are not actively used in tier2 (0 reference count) are eligible for movement from tier2 to tier1 based on following policies.
Tier3 lifecycle policies
Files with stale commits older than X days are eligible for eviction from tier3. This can be done as a later work.
File Cache Stats
Cache also provides metrics around its usage, evictions, hit rate, miss rate etc at a shard granularity as well as at a node granularity.
File Cache Admission Control and throttling
We are also going to add an admission control on file cache based on current capacity, maximum configured capacity and watermark settings of file cache which can result in a read/write block on a node.
We are also proposing to track the total bytes downloaded from remote store and total bytes removed from local store by life cycle manager and throttle downloading of new files from remote store based on a total bytes downloaded on disk as measured against disk watermark settings.
Potential Issues
Future work
The text was updated successfully, but these errors were encountered: