Skip to content

DSA dedicated mode #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
f01d9b1
Binaries added to ignore list
Grzegorz-Rys Mar 1, 2024
c9fbe21
Numa awareness implementation
Grzegorz-Rys Mar 1, 2024
4a040a0
dsa_init_from_wq_list: when dedicated wq found skip it and continue
Grzegorz-Rys Mar 11, 2024
e811f20
Static code analysis with clangd
Grzegorz-Rys Mar 11, 2024
faf59a3
Code review changes
Grzegorz-Rys Mar 18, 2024
8e79017
Required packages list updated
Grzegorz-Rys Mar 20, 2024
af28b39
DSA dedicated mode added (shared mode is default)
Grzegorz-Rys Mar 25, 2024
729250e
Buffer centric and cpu centric numa wareness added:
Grzegorz-Rys Apr 5, 2024
60a7247
buffer centric/cpu centric changed to buffer-centric/cpu-centric
Grzegorz-Rys Apr 8, 2024
dee2e3d
Comment corrected
Grzegorz-Rys Apr 9, 2024
f7034ab
numa_aware branch merged
Grzegorz-Rys Apr 9, 2024
c1a204f
'is_numa_aware:' changed to: 'numa_awareness'
Grzegorz-Rys Apr 9, 2024
959df5f
Merge branch 'numa_aware' into dedicated_mode
Grzegorz-Rys Apr 9, 2024
a4ce39c
enum DSA_MODE introduced
Grzegorz-Rys Apr 9, 2024
11ccd8c
Rdundant buf != NULL check removed
Grzegorz-Rys Apr 9, 2024
8a483cf
Whitespaces removed
Grzegorz-Rys Apr 10, 2024
fb41ad5
More whitespaces removed
Grzegorz-Rys Apr 10, 2024
078093f
Removed unneeded comment
Grzegorz-Rys Apr 10, 2024
6108552
Merge branch 'numa_aware' into dedicated_mode
Grzegorz-Rys Apr 11, 2024
3aaea91
Whitespaces removed
Grzegorz-Rys Apr 11, 2024
a2340e1
Removed unneeded code
Grzegorz-Rys Apr 12, 2024
3ba305f
Merge branch 'main' into dedicated_mode
Grzegorz-Rys Apr 19, 2024
c8d1ed7
Outstanding descriptors updated for dedicated mode
Grzegorz-Rys Apr 29, 2024
d17120e
Added:
Grzegorz-Rys Apr 29, 2024
817e678
Updated description
Grzegorz-Rys Apr 29, 2024
1130de0
LOG_TRACE formatted
Grzegorz-Rys Apr 29, 2024
d48a129
LOG_TRACE formatted more
Grzegorz-Rys Apr 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 14 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ The library intercepts standard memcpy, memmove, memset, and memcmp standard API
and transparently uses DSA to perform those operations using DSA's memory move, fill, and compare operations. DTO is limited to
synchronous offload model since these APIs have synchronous semantics.

DTO library works with DSA's Shared Work Queues (SWQs). DTO also works with multiple DSAs and uses them in round robin manner.
During initialization, DTO library can either auto-discover all configured SWQs (potentially on multile DSAs), or a list of specific SWQs that is
DTO library works with DSA's Shared and Dedicated Work Queues (SWQs). DTO also works with multiple DSAs and uses them in round robin manner.
During initialization, DTO library can either auto-discover all configured SWQs (potentially on multile DSAs), or a list of specific SWQs that is
specified using an environment variable DTO_WQ_LIST.

DTO library falls back to using standard APIs on CPU under following scenarios:
Expand All @@ -18,18 +18,18 @@ DTO library falls back to using standard APIs on CPU under following scenarios:

To improve throughput for synchronous offload, DTO uses "pseudo asynchronous" execution using following steps.
1) After intercepting the API call, DTO splits the API job into two parts; 1) CPU job and 2) DSA job. For example, a 64 KB memcpy may
be split into 20 KB CPU job and 44 KB DSA job. The split fraction can be configured using an environment variable DTO_CPU_SIZE_FRACTION.
2) DTO submits the DSA portion of the job to DSA.
If DTO_IS_NUMA_AWARE=1 DTO uses work queues of DSA device located on the same numa node as
be split into 20 KB CPU job and 44 KB DSA job. The split fraction can be configured using an environment variable DTO_CPU_SIZE_FRACTION.
2) DTO submits the DSA portion of the job to DSA.
If DTO_IS_NUMA_AWARE=1 DTO uses work queues of DSA device located on the same numa node as
buffer (memcpy/memmove - dest buffer, memcmp - ptr2) delivered to method - buffer-centric numa awareness.
If DTO_IS_NUMA_AWARE=2 DTO uses work queues of DSA device located on the same numa node as
If DTO_IS_NUMA_AWARE=2 DTO uses work queues of DSA device located on the same numa node as
calling thread cpu - cpu-centric numa awareness.
3) In parallel, DTO performs the CPU portion of the job using std library on CPU.
4) DTO waits for DSA to complete (if it hasn't completed already). The wait method can be configured using an environment variable DTO_WAIT_METHOD.

DTO also implements a heuristic to auto tune dsa_min_bytes and cpu_size_fraction parameters based on current DSA load. For example, if DSA is heavily loaded,
DTO tries to reduce the DSA load by increasing cpu_size_fraction and dsa_min_bytes. Conversely, if DSA is lightly loaded, DTO tries to increase the DSA load by
decreasing cpu_size_fraction and dsa_min_bytes. The goal of the heuristic is to minimize the wait time in step 4 above while maximizing throughput. The auto-tuning
decreasing cpu_size_fraction and dsa_min_bytes. The goal of the heuristic is to minimize the wait time in step 4 above while maximizing throughput. The auto-tuning
can be enabled or disabled using an environment variable DTO_AUTO_ADJUST_KNOBS.

DTO can also be used to learn certain application characterics by building histogram of various API types and sizes. The histogram can be built using an environment variable DTO_COLLECT_STATS.
Expand All @@ -51,6 +51,12 @@ Following environment variables control the behavior of DTO library:
DTO_IS_NUMA_AWARE=0/1/2 (disables/buffer-centric/cpu-centric numa awareness. 0 -- disable (default), 1 -- buffer-centric, 2 - cpu-centric)
DTO_WQ_LIST="semi-colon(;) separated list of DSA WQs to use". The WQ names should match their names in /dev/dsa/ directory (see example below).
If not specified, DTO will try to auto-discover and use all available WQs.
DTO_DSA_MEMCPY=0/1, 1 (default) - DTO uses DSA to process memcpy, 0 - DTO uses system memcpy
DTO_DSA_MEMMOVE=0/1, 1 (default) - DTO uses DSA to process memmove, 0 - DTO uses system memmove
DTO_DSA_MEMSET=0/1, 1 (default) - DTO uses DSA to process memset, 0 - DTO use system memset
DTO_DSA_MEMCMP=0/1, 1 (default) - DTO uses DSA to process memcmp, 0 - DTO use system memcmp
DTO_ENQCMD_MAX_RETRIES=xxxx defines maximal number of retries for enquing command into DSA queue, default is 3
DTO_UMWAIT_DELAY=xxxx defines delay for umwait command (check max possible value at: /sys/devices/system/cpu/umwait_control/max_time), default is 100000
DTO_LOG_FILE=<dto log file path> Redirect the DTO output to the specified file instead of std output (useful for debugging and statistics collection). file name is suffixed by process pid.
DTO_LOG_LEVEL=0/1/2 controls the log level. higher value means more verbose logging (default 0).
```
Expand Down Expand Up @@ -138,7 +144,7 @@ Byte Range -- set cpy mov cmp bytes set c
>=2093056 -- 0 1 0 0 1975911 0 0 0 0 0 0 1 0 0 973209 0 1 0

******** Average Memory Operation Latency (us) ********
<******** stdc calls ********> <******** dsa (success) ********> <******** dsa (failed) ********>
<******** stdc calls ********> <******** dsa (success) ********> <******** dsa (failed) ********>
Byte Range -- set cpy mov cmp set cpy mov cmp set cpy mov cmp
0-4095 -- 0.01 0.02 0.01 0.04 0 0 0 0 0 0 0 0
4096-8191 -- 0.07 0.42 0.47 0 0 0 0 0 0 0 0 0
Expand Down
Loading