intel · jsydir · Jun 12, 2025 · Jun 12, 2025 · Jun 12, 2025 · Jun 12, 2025
diff --git a/README.md b/README.md
@@ -61,6 +61,28 @@ Following environment variables control the behavior of DTO library:
 	DTO_LOG_LEVEL=0/1/2 controls the log level. higher value means more verbose logging (default 0).
 ```
 
+Although not the only usage models of DTO, the following are some common ones:
+   Latency reduction - the goal is to minimize the latency of offloaded operations. Use the following settings:
+      DTO_AUTO_ADJUST_KNOBS=1 (the CPU fraction setting is critical to this mode. The optimal value is dynamic so autotune algorithm needs to be enabled) 
+      DTO_WAIT_METHOD=busypoll 
+
+   Power Reduction - the goal is to reduce power by offloading memory operations to DSA allowing the cpu core to go into a lower power state. This mode may reduce or increase the latency of operations depending on the load on DSA devices.
+      DTO_AUTO_ADJUST_KNOBS=0 
+      DTO_CPU_SIZE_FRACTION=0.0  (offload the entire operations to DSA) 
+      DTO_WAIT_METHOD=umwait 
+
+   Cycle count Reduction - the goal is to reduce cpu cycles by offloading memory operations to DSA. This mode may reduce or increase the latency of operations depending on the load on DSA devices and on interaction with the OS scheduler and other threads. The idea is to offload operations to DSA and allow the OS to schedule other work while DSA perform the operation.
+      DTO_AUTO_ADJUST_KNOBS=0 
+      DTO_CPU_SIZE_FRACTION=0.0  (offload the entire operations to DSA) 
+      DTO_WAIT_METHOD=yield
+
+   Avoiding Cache polution - the goal is to avoid polluting the cache with data from the given process.
+      DTO_DSA_CC=0
+      DTO_AUTO_ADJUST_KNOBS=0 
+      DTO_CPU_SIZE_FRACTION=0.0  (offload the entire operations to DSA so none of the data is pulled into cache) 
+      DTO_WAIT_METHOD=yield or umwait (saves either cycles or power)
+
+
 ## Build
 
 Pre-requisite packages:

diff --git a/dto.c b/dto.c
@@ -44,7 +44,7 @@
  */
 #define MAX_WQS 32
 #define MAX_NUMA_NODES 32
-#define DTO_DEFAULT_MIN_SIZE 16384
+#define DTO_DEFAULT_MIN_SIZE 65536
 #define DTO_INITIALIZED 0
 #define DTO_INITIALIZING 1
 
@@ -429,6 +429,13 @@ static __always_inline void dsa_wait_and_adjust(const volatile uint8_t *comp)
 		__dsa_wait(comp);
 		local_num_waits++;
 	}
+
+	// operations that have failed (mostly due to page fault) return very quickly and cause the algorithm
+	// to think that the DSA operation was faster than it really was. We exclude them from the calculation.
+	if (*comp != DSA_COMP_SUCCESS) {
+		return;
+	}
+
 	adjust_num_descs++;
 	adjust_num_waits += local_num_waits;