igchor · ss7pro · Jul 6, 2022 · Jun 11, 2022 · Jun 10, 2022 · Jul 12, 2022
diff --git a/MultiTierDataMovement.md b/MultiTierDataMovement.md
@@ -0,0 +1,117 @@
+# Background Data Movement
+
+In order to reduce the number of online evictions and support asynchronous
+promotion - we have added two periodic workers to handle eviction and promotion.
+
+The diagram below shows a simplified version of how the background evictor
+thread (green) is integrated to the CacheLib architecture. 
+
+<p align="center">
+  <img width="640" height="360" alt="BackgroundEvictor" src="cachelib-background-evictor.png">
+</p>
+
+## Synchronous Eviction and Promotion
+
+- `disableEvictionToMemory`: Disables eviction to memory (item is always evicted to NVMe or removed
+on eviction)
+
+## Background Evictors
+
+The background evictors scan each class to see if there are objects to move the next (lower)
+tier using a given strategy. Here we document the parameters for the different
+strategies and general parameters. 
+
+- `backgroundEvictorIntervalMilSec`: The interval that this thread runs for - by default
+the background evictor threads will wake up every 10 ms to scan the AllocationClasses. Also,
+the background evictor thead will be woken up everytime there is a failed allocation (from
+a request handling thread) and the current percentage of free memory for the 
+AllocationClass is lower than `lowEvictionAcWatermark`. This may render the interval parameter
+not as important when there are many allocations occuring from request handling threads. 
+
+- `evictorThreads`: The number of background evictors to run - each thread is a assigned
+a set of AllocationClasses to scan and evict objects from. Currently, each thread gets
+an equal number of classes to scan - but as object size distribution may be unequal - future
+versions will attempt to balance the classes among threads. The range is 1 to number of AllocationClasses.
+The default is 1. 
+
+- `maxEvictionBatch`: The number of objects to remove in a given eviction call. The
+default is 40. Lower range is 10 and the upper range is 1000. Too low and we might not
+remove objects at a reasonable rate, too high and it might increase contention with user threads.
+
+- `minEvictionBatch`: Minimum number of items to evict at any time (if there are any
+candidates)
+
+- `maxEvictionPromotionHotness`: Maximum candidates to consider for eviction. This is similar to `maxEvictionBatch`
+but it specifies how many candidates will be taken into consideration, not the actual number of items to evict.
+This option can be used to configure duration of critical section on LRU lock.
+
+
+### FreeThresholdStrategy (default)
+
+- `lowEvictionAcWatermark`: Triggers background eviction thread to run
+when this percentage of the AllocationClass is free. 
+The default is `2.0`, to avoid wasting capacity we don't set this above `10.0`.
+
+- `highEvictionAcWatermark`: Stop the evictions from an AllocationClass when this 
+percentage of the AllocationClass is free. The default is `5.0`, to avoid wasting capacity we
+don't set this above `10`.
+
+
+## Background Promoters
+
+The background promotes scan each class to see if there are objects to move to a lower
+tier using a given strategy. Here we document the parameters for the different
+strategies and general parameters.
+
+- `backgroundPromoterIntervalMilSec`: The interval that this thread runs for - by default
+the background promoter threads will wake up every 10 ms to scan the AllocationClasses for
+objects to promote.
+
+- `promoterThreads`: The number of background promoters to run - each thread is a assigned
+a set of AllocationClasses to scan and promote objects from. Currently, each thread gets
+an equal number of classes to scan - but as object size distribution may be unequal - future
+versions will attempt to balance the classes among threads. The range is `1` to number of AllocationClasses. The default is `1`.
+
+- `maxProtmotionBatch`: The number of objects to promote in a given promotion call. The
+default is 40. Lower range is 10 and the upper range is 1000. Too low and we might not
+remove objects at a reasonable rate, too high and it might increase contention with user threads. 
+
+- `minPromotionBatch`: Minimum number of items to promote at any time (if there are any
+candidates)
+
+- `numDuplicateElements`: This allows us to promote items that have existing handles (read-only) since
+we won't need to modify the data when a user is done with the data. Therefore, for a short time
+the data could reside in both tiers until it is evicted from its current tier. The default is to
+not allow this (0). Setting the value to 100 will enable duplicate elements in tiers.
+
+### Background Promotion Strategy (only one currently)
+
+- `promotionAcWatermark`: Promote items if there is at least this
+percent of free AllocationClasses. Promotion thread will attempt to move `maxPromotionBatch` number of objects
+to that tier. The objects are chosen from the head of the LRU. The default is `4.0`.
+This value should correlate with `lowEvictionAcWatermark`, `highEvictionAcWatermark`, `minAcAllocationWatermark`, `maxAcAllocationWatermark`.
+- `maxPromotionBatch`: The number of objects to promote in batch during BG promotion. Analogous to
+`maxEvictionBatch`. It's value should be lower to decrease contention on hot items.
+
+## Allocation policies
+
+- `maxAcAllocationWatermark`:  Item is always allocated in topmost tier if at least this 
+percentage of the AllocationClass is free.
+- `minAcAllocationWatermark`: Item is always allocated in bottom tier if only this percent
+of the AllocationClass is free. If percentage of free AllocationClasses is between `maxAcAllocationWatermark`
+and `minAcAllocationWatermark`: then extra checks (described below) are performed to decide where to put the element.
+
+By default, allocation will always be performed from the upper tier.
+
+- `acTopTierEvictionWatermark`: If there is less that this percent of free memory in topmost tier, cachelib will attempt to evict from top tier. This option takes precedence before allocationWatermarks.
+
+### Extra policies (used only when  percentage of free AllocationClasses is between `maxAcAllocationWatermark`
+and `minAcAllocationWatermark`)
+- `sizeThresholdPolicy`: If item is smaller than this value, always allocate it in upper tier.
+- `defaultTierChancePercentage`: Change (0-100%) of allocating item in top tier
+
+## MMContainer options
+
+- `lruInsertionPointSpec`: Can be set per tier when LRU2Q is used. Determines where new items are
+inserted. 0 = insert to hot queue, 1 = insert to warm queue, 2 = insert to cold queue
+- `markUsefulChance`: Per-tier, determines chance of moving item to the head of LRU on access
diff --git a/cachelib-background-evictor.png b/cachelib-background-evictor.png
diff --git a/cachelib/allocator/BackgroundEvictor-inl.h b/cachelib/allocator/BackgroundEvictor-inl.h
@@ -0,0 +1,110 @@
+/*
+ * Copyright (c) Intel and its affiliates.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace facebook {
+namespace cachelib {
+
+
+template <typename CacheT>
+BackgroundEvictor<CacheT>::BackgroundEvictor(Cache& cache,
+                               std::shared_ptr<BackgroundEvictorStrategy> strategy)
+    : cache_(cache),
+      strategy_(strategy)
+{
+}
+
+template <typename CacheT>
+BackgroundEvictor<CacheT>::~BackgroundEvictor() { stop(std::chrono::seconds(0)); }
+
+template <typename CacheT>
+void BackgroundEvictor<CacheT>::work() {
+  try {
+    checkAndRun();
+  } catch (const std::exception& ex) {
+    XLOGF(ERR, "BackgroundEvictor interrupted due to exception: {}", ex.what());
+  }
+}
+
+template <typename CacheT>
+void BackgroundEvictor<CacheT>::setAssignedMemory(std::vector<std::tuple<TierId, PoolId, ClassId>> &&assignedMemory)
+{
+  XLOG(INFO, "Class assigned to background worker:");
+  for (auto [tid, pid, cid] : assignedMemory) {
+    XLOGF(INFO, "Tid: {}, Pid: {}, Cid: {}", tid, pid, cid);
+  }
+
+  mutex.lock_combine([this, &assignedMemory]{
+    this->assignedMemory_ = std::move(assignedMemory);
+  });
+}
+
+// Look for classes that exceed the target memory capacity
+// and return those for eviction
+template <typename CacheT>
+void BackgroundEvictor<CacheT>::checkAndRun() {
+  auto assignedMemory = mutex.lock_combine([this]{
+    return assignedMemory_;
+  });
+
+  unsigned int evictions = 0;
+  std::set<ClassId> classes{};
+  auto batches = strategy_->calculateBatchSizes(cache_,assignedMemory);
+
+  for (size_t i = 0; i < batches.size(); i++) {
+    const auto [tid, pid, cid] = assignedMemory[i];
+    const auto batch = batches[i];
+
+    classes.insert(cid);
+    const auto& mpStats = cache_.getPoolByTid(pid,tid).getStats();
+
+    if (!batch) {
+      continue;
+    }
+
+    stats.evictionSize.add(batch * mpStats.acStats.at(cid).allocSize);
+
+    //try evicting BATCH items from the class in order to reach free target
+    auto evicted =
+        BackgroundEvictorAPIWrapper<CacheT>::traverseAndEvictItems(cache_,
+            tid,pid,cid,batch);
+    evictions += evicted;
+    evictions_per_class_[tid][pid][cid] += evicted;
+  }
+
+  stats.numTraversals.inc();
+  stats.numEvictedItems.add(evictions);
+  stats.totalClasses.add(classes.size());
+}
+
+template <typename CacheT>
+BackgroundEvictionStats BackgroundEvictor<CacheT>::getStats() const noexcept {
+  BackgroundEvictionStats evicStats;
+  evicStats.numEvictedItems = stats.numEvictedItems.get();
+  evicStats.runCount = stats.numTraversals.get();
+  evicStats.evictionSize = stats.evictionSize.get();
+  evicStats.totalClasses = stats.totalClasses.get();
+
+  return evicStats;
+}
+
+template <typename CacheT>
+std::map<TierId, std::map<PoolId, std::map<ClassId, uint64_t>>>
+BackgroundEvictor<CacheT>::getClassStats() const noexcept {
+  return evictions_per_class_;
+}
+
+} // namespace cachelib
+} // namespace facebook
diff --git a/cachelib/allocator/BackgroundEvictor.h b/cachelib/allocator/BackgroundEvictor.h
@@ -0,0 +1,99 @@
+/*
+ * Copyright (c) Intel and its affiliates.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <gtest/gtest_prod.h>
+#include <folly/concurrency/UnboundedQueue.h>
+
+#include "cachelib/allocator/CacheStats.h"
+#include "cachelib/common/PeriodicWorker.h"
+#include "cachelib/allocator/BackgroundEvictorStrategy.h"
+#include "cachelib/common/AtomicCounter.h"
+
+
+namespace facebook {
+namespace cachelib {
+
+// wrapper that exposes the private APIs of CacheType that are specifically
+// needed for the eviction.
+template <typename C>
+struct BackgroundEvictorAPIWrapper {
+
+  static size_t traverseAndEvictItems(C& cache,
+          unsigned int tid, unsigned int pid, unsigned int cid, size_t batch) {
+    return cache.traverseAndEvictItems(tid,pid,cid,batch);
+  }
+};
+
+struct BackgroundEvictorStats {
+  // items evicted
+  AtomicCounter numEvictedItems{0};
+
+  // traversals
+  AtomicCounter numTraversals{0};
+
+  // total class size
+  AtomicCounter totalClasses{0};
+
+  // item eviction size
+  AtomicCounter evictionSize{0};
+};
+
+// Periodic worker that evicts items from tiers in batches
+// The primary aim is to reduce insertion times for new items in the
+// cache
+template <typename CacheT>
+class BackgroundEvictor : public PeriodicWorker {
+ public:
+  using Cache = CacheT;
+  // @param cache               the cache interface
+  // @param target_free         the target amount of memory to keep free in 
+  //                            this tier
+  // @param tier id             memory tier to perform eviction on 
+  BackgroundEvictor(Cache& cache,
+                    std::shared_ptr<BackgroundEvictorStrategy> strategy);
+
+  ~BackgroundEvictor() override;
+
+  BackgroundEvictionStats getStats() const noexcept;
+  std::map<TierId, std::map<PoolId, std::map<ClassId, uint64_t>>> getClassStats() const noexcept;
+
+  void setAssignedMemory(std::vector<std::tuple<TierId, PoolId, ClassId>> &&assignedMemory);
+
+ private:
+   std::map<TierId, std::map<PoolId, std::map<ClassId, uint64_t>>> evictions_per_class_;
+
+  // cache allocator's interface for evicting
+
+  using Item = typename Cache::Item;
+
+  Cache& cache_;
+  std::shared_ptr<BackgroundEvictorStrategy> strategy_;
+
+  // implements the actual logic of running the background evictor
+  void work() override final;
+  void checkAndRun();
+
+  BackgroundEvictorStats stats;
+
+  std::vector<std::tuple<TierId, PoolId, ClassId>> assignedMemory_;
+  folly::DistributedMutex mutex;
+};
+} // namespace cachelib
+} // namespace facebook
+
+#include "cachelib/allocator/BackgroundEvictor-inl.h"
diff --git a/cachelib/allocator/BackgroundEvictorStrategy.h b/cachelib/allocator/BackgroundEvictorStrategy.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright (c) Facebook, Inc. and its affiliates.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "cachelib/allocator/Cache.h"
+
+namespace facebook {
+namespace cachelib {
+
+// Base class for background eviction strategy.
+class BackgroundEvictorStrategy {
+
+public:
+  virtual std::vector<size_t> calculateBatchSizes(const CacheBase& cache,
+                                       std::vector<std::tuple<TierId, PoolId, ClassId>> acVec) = 0;
+};
+
+} // namespace cachelib
+} // namespace facebook