Skip to content

Extend tiering apis to support fragments #6776

@romange

Description

@romange

Problem

Current tiering APIs are primarily key/value (PrimeValue) oriented and externalize a single serialized value per key (with optional cooling). This works for strings and current hash path, but does not provide a first-class API for partial offloading of multi-element data structures (list segments, stream ranges, etc).

We need to support offloading arbitrary object fragments (not only whole PrimeValue) so commands can fetch/mutate only relevant parts.

Current state (relevant code)

  • TieredStorage API is centered on key/value stash/read/delete and PrimeValue state transitions:
// Enqueue read external value with generic decoder.
template <typename D, typename F>
void Read(DbIndex dbid, std::string_view key, const tiering::DiskSegment& segment,
          const D& decoder, F&& f) { ... }

std::optional<StashDescriptor> ShouldStash(const PrimeValue& pv) const;
std::optional<util::fb2::Future<bool>> Stash(DbIndex dbid, std::string_view key,
                                             const StashDescriptor& blobs, bool provide_bp);
void Delete(DbIndex dbid, PrimeValue* value);
void CancelStash(DbIndex dbid, std::string_view key, PrimeValue* value);
  • OpManager already supports generic pending IDs and async segment ops:
using PendingId = std::variant<unsigned, KeyRef>;
void Enqueue(PendingId id, DiskSegment segment, const Decoder& decoder, ReadCallback cb);
void DeleteOffloaded(DiskSegment segment);
std::error_code PrepareAndStash(PendingId id, size_t length,
                                const std::function<size_t(io::MutableBytes)>& writer);
  • SmallBins exists for sub-page packing but should not be used for this feature path:
// Small bins accumulate small values into larger bins that fill up 4kb pages.
class SmallBins { ... };

Goal

Introduce a fragment-tiering API that allows composite objects to offload/fetch/update independent fragments, with these constraints:

  • Fragment offload path is direct-to-segment (no SmallBins).
  • Offload only fragments with serialized size >= 2KB.
  • Preserve existing whole-value tiering behavior for current users.

Proposed API / design direction

1) New fragment identity

TieredStorage APIs should not work with PrimeValue directly. Instead they should accept a Fragment, like
std::optional<StashDescriptor> ShouldStash(const Fragment& fragment) const;

Fragment is an adaptor interface that wraps variant<PrimeValue,...> to support differrent types like stream listpacks.
TieredStorage does not operate or update on PrimeValue directly.

2) OpManager pending id support for fragment keys

Extend pending ID to track fragment ops unambiguously, while preserving existing key/bin behavior. We do not need to support small bin packing for collections (other non-string types), and assume a segment will contain a single offloaded value. We may just need to extend PendingId with uintptr_t which is an address of the heap based object we are trying to offload. Question: Do we even need to map from PendingId back to the original data-structure?

3) Support fragments in TieredColdRecord (cool_queue_). Is not a blocker to the POC

We may want to cache offloaded items like we do with strings to wait until we are close to memory limits before starting the eviction.

Out of scope

  • Full list/stream adapter implementation details.
  • Command-level behavior changes (LRANGE, XRANGE, etc).
  • RDB/replication support

Acceptance criteria

  1. New fragment-tiering APIs exist and are callable independently of PrimeValue externalization.
  2. Fragment stashes <2KB are rejected by API contract and Fragment path never invokes SmallBins.
  3. Existing string/hash offload behavior remains unchanged.
  4. Unit tests cover:
    • successful fragment stash/read/delete
    • <2KB rejection
    • pending op cancellation
    • no regression in existing tiering tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions