-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Problem
Current tiering APIs are primarily key/value (PrimeValue) oriented and externalize a single serialized value per key (with optional cooling). This works for strings and current hash path, but does not provide a first-class API for partial offloading of multi-element data structures (list segments, stream ranges, etc).
We need to support offloading arbitrary object fragments (not only whole PrimeValue) so commands can fetch/mutate only relevant parts.
Current state (relevant code)
TieredStorageAPI is centered on key/value stash/read/delete andPrimeValuestate transitions:
// Enqueue read external value with generic decoder.
template <typename D, typename F>
void Read(DbIndex dbid, std::string_view key, const tiering::DiskSegment& segment,
const D& decoder, F&& f) { ... }
std::optional<StashDescriptor> ShouldStash(const PrimeValue& pv) const;
std::optional<util::fb2::Future<bool>> Stash(DbIndex dbid, std::string_view key,
const StashDescriptor& blobs, bool provide_bp);
void Delete(DbIndex dbid, PrimeValue* value);
void CancelStash(DbIndex dbid, std::string_view key, PrimeValue* value);OpManageralready supports generic pending IDs and async segment ops:
using PendingId = std::variant<unsigned, KeyRef>;
void Enqueue(PendingId id, DiskSegment segment, const Decoder& decoder, ReadCallback cb);
void DeleteOffloaded(DiskSegment segment);
std::error_code PrepareAndStash(PendingId id, size_t length,
const std::function<size_t(io::MutableBytes)>& writer);SmallBinsexists for sub-page packing but should not be used for this feature path:
// Small bins accumulate small values into larger bins that fill up 4kb pages.
class SmallBins { ... };Goal
Introduce a fragment-tiering API that allows composite objects to offload/fetch/update independent fragments, with these constraints:
- Fragment offload path is direct-to-segment (no
SmallBins). - Offload only fragments with serialized size >= 2KB.
- Preserve existing whole-value tiering behavior for current users.
Proposed API / design direction
1) New fragment identity
TieredStorage APIs should not work with PrimeValue directly. Instead they should accept a Fragment, like
std::optional<StashDescriptor> ShouldStash(const Fragment& fragment) const;
Fragment is an adaptor interface that wraps variant<PrimeValue,...> to support differrent types like stream listpacks.
TieredStorage does not operate or update on PrimeValue directly.
2) OpManager pending id support for fragment keys
Extend pending ID to track fragment ops unambiguously, while preserving existing key/bin behavior. We do not need to support small bin packing for collections (other non-string types), and assume a segment will contain a single offloaded value. We may just need to extend PendingId with uintptr_t which is an address of the heap based object we are trying to offload. Question: Do we even need to map from PendingId back to the original data-structure?
3) Support fragments in TieredColdRecord (cool_queue_). Is not a blocker to the POC
We may want to cache offloaded items like we do with strings to wait until we are close to memory limits before starting the eviction.
Out of scope
- Full list/stream adapter implementation details.
- Command-level behavior changes (
LRANGE,XRANGE, etc). - RDB/replication support
Acceptance criteria
- New fragment-tiering APIs exist and are callable independently of
PrimeValueexternalization. - Fragment stashes <2KB are rejected by API contract and Fragment path never invokes
SmallBins. - Existing string/hash offload behavior remains unchanged.
- Unit tests cover:
- successful fragment stash/read/delete
- <2KB rejection
- pending op cancellation
- no regression in existing tiering tests.