Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for CPU Affinity usage #6922

Merged
merged 26 commits into from
Mar 20, 2023
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
40bf25d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 15, 2023
5f378df
add manual for cpu affinity
JakubPietrakIntel Mar 15, 2023
1d0430f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2023
780053f
added references and improved intro
JakubPietrakIntel Mar 16, 2023
5f23a69
merge
JakubPietrakIntel Mar 16, 2023
fbeb8f5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2023
b79547b
update results
JakubPietrakIntel Mar 16, 2023
4d35a91
Merge branch 'affinity_docs' of https://github.com/JakubPietrakIntel/…
JakubPietrakIntel Mar 16, 2023
05a1ef4
update
rusty1s Mar 16, 2023
3c8425c
Merge branch 'affinity_docs' of github.com:JakubPietrakIntel/pytorch_…
rusty1s Mar 16, 2023
f877d10
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2023
1af5708
update
rusty1s Mar 16, 2023
b454420
formatting
JakubPietrakIntel Mar 16, 2023
fa92353
Merge branch 'affinity_docs' of https://github.com/JakubPietrakIntel/…
JakubPietrakIntel Mar 16, 2023
fa0b344
update
rusty1s Mar 16, 2023
14868d2
Merge branch 'affinity_docs' of github.com:JakubPietrakIntel/pytorch_…
rusty1s Mar 16, 2023
bbc6c4e
Merge branch 'affinity_docs' of github.com:JakubPietrakIntel/pytorch_…
rusty1s Mar 16, 2023
437ba33
update docs for AffinityMixin and display
JakubPietrakIntel Mar 16, 2023
d7eaaea
Merge branch 'affinity_docs' of https://github.com/JakubPietrakIntel/…
JakubPietrakIntel Mar 16, 2023
80a7800
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2023
e868757
rm not used inference_affinity.png
JakubPietrakIntel Mar 16, 2023
c19f5d2
changelog
rusty1s Mar 20, 2023
794ca1b
changelog
rusty1s Mar 20, 2023
51773c5
changelog
rusty1s Mar 20, 2023
092a505
changelog
rusty1s Mar 20, 2023
c3e725f
Merge branch 'master' into affinity_docs
rusty1s Mar 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update
  • Loading branch information
rusty1s committed Mar 16, 2023
commit fa0b3444f9fe3a0638c74fce9a59817576ff02e7
46 changes: 31 additions & 15 deletions docs/source/advanced/cpu_affinity.rst
Original file line number Diff line number Diff line change
@@ -1,37 +1,53 @@
CPU affinity settings for PyG workloads
==========================================
CPU Affinity for PyG Workloads
==============================

The performance of :pyg:`PyG` workloads using CPU can be significantly improved by setting a proper affinity mask. Processor affinity, or core binding is a modification of native OS queue scheduling algorithm, that enables an application to assign a specific set of cores to processes or threads launched during its execution on the CPU.
In consequence, it is possible to increase the overall effective hardware utilisation, by minimizing core stalls and memory bounds. It also secures the CPU resources to critical processes or threads, even if the system is under a heavy load. The affinity targets the two main performance-critical regions:
The performance of :pyg:`PyG` workloads using CPU can be significantly improved by setting a proper affinity mask.
Processor affinity, or core binding is a modification of native OS queue scheduling algorithm, that enables an application to assign a specific set of cores to processes or threads launched during its execution on the CPU.
In consequence, it is possible to increase the overall effective hardware utilisation, by minimizing core stalls and memory bounds.
It also secures the CPU resources to critical processes or threads, even if the system is under a heavy load.
The affinity targets the two main performance-critical regions:

* Execution bind: indicates a core where process/thread will run.

* Memory bind: indicates a preferred memory area where memory pages will be bound (local areas in NUMA machine).

The following article discusses readily available tools and environment settings that one can use to maximize the performance of Intel CPUs with :pyg:`PyG`.

**Disclaimer:** Overall, CPU affinity can be a useful tool for improving the performance and predictability of certain types of applications, but but one configuration does not fit all cases: it is important to carefully consider whether CPU affinity is appropriate for your use case, and to test and measure the impact of any changes you make.
.. note::
Overall, CPU affinity can be a useful tool for improving the performance and predictability of certain types of applications, but one configuration does not fit all cases: it is important to carefully consider whether CPU affinity is appropriate for your use case, and to test and measure the impact of any changes you make.

Using :attr:`cpu-affinity` and :attr:`filter-per-worker` setting
-------------------------------------------------------------------
Each :pyg:`PyG` workload can be parallelized using a base-class Pytorch iterator :class:`MultiProcessingDataLoaderIter`, which is automatically enabled, when the :attr:`num_workers` > 0 on :class:`~torch_geometric.loader.DataLoader` initialization. It creates number of sub-processes equal to :attr:`num_workers` that will run in parallel to the main process.
Setting CPU affinity mask for the data loading processes places :class:`~torch_geometric.loader.DataLoader` workers threads on specific CPU cores. In effect, it allows for more efficient data batch preparation by allocating pre-fetched batches in local memory. Every time a process or thread moves from one core to another, registers and caches need to be flushed and reloaded. This can become very costly if it happens often, and our threads may also no longer be close to their data, or be able to share data in a cache.
Using :attr:`cpu-affinity` and :attr:`filter-per-worker`
--------------------------------------------------------

Since :pyg:`PyG` 2.3 release :class:`~torch_geometric.loader.NeighborLoader` and :class:`~torch_geometric.loader.LinkLoader` officially support native solution for CPU affinity using context manager :class:`~torch_geometric.loader.AffinityMixin` class. The affinity can be enabled with :func:`enable_cpu_affinity()` method only for multi-process :class:`~torch_geometric.loader.DataLoader` (only when :attr:`num_workers` > 0 ).
This will assign a separate core to each worker at initialization. User-defined list of core IDs may be assinged using :attr:`loader_cores` argument. Else, :class:`~torch_geometric.loader.DataLoader` cores will be assigned automatically, starting at core ID 0. As of now, only a signle core can be assigned to a worker, hence by default multi-threading is disabled in workers' processes.
The recommended number of workers to start with is in range [2,4], the optimum may vary based on workload characteristics.
Each :pyg:`PyG` workload can be parallelized using the :pytorch:`PyTorch` iterator class :class:`MultiProcessingDataLoaderIter`, which is automatically enabled in case :obj:`num_workers > 0` is passed to a :class:`torch.utils.data.DataLoader`.
Under the hood, it creates :obj:`num_workers` many sub-processes that will run in parallel to the main process.
Setting a CPU affinity mask for the data loading processes places :class:`~torch.utils.data.DataLoader` worker threads on specific CPU cores.
In effect, it allows for more efficient data batch preparation by allocating pre-fetched batches in local memory.
Every time a process or thread moves from one core to another, registers and caches need to be flushed and reloaded.
This can become very costly if it happens often, and threads may also no longer be close to their data, or be able to share data in a cache.

Since :pyg:`PyG` (2.3 and beyond), :class:`~torch_geometric.loader.NodeLoader` and :class:`~torch_geometric.loader.LinkLoader` classes officially support a native solution for CPU affinity using the :class:`~torch_geometric.loader.AffinityMixin` context manager.
CPU affinity can be enabled via the :func:`enable_cpu_affinity()` method for :obj:`num_workers > 0` use-cases,
and will guarantee that a separate core is assigned to each worker at initialization.
A user-defined list of core IDs may be assigned using the :attr:`loader_cores` argument.
Otherwise, cores will be assigned automatically, starting at core ID 0.
As of now, only a single core can be assigned to a worker, hence multi-threading is disabled in workers' processes by default.
The recommended number of workers to start with lies between :obj:`[2, 4]`, and the optimum may vary based on workload characteristics:

.. code-block:: python

loader = NeigborLoader(data,
loader = NeigborLoader(
data,
num_workers=3,
filter_per_worker=True,
**kwargs)
...,
)

with loader.enable_cpu_affinity(loader_cores=[0, 1, 2]):
for batch in loader:
pass

It is generally adivisable to use :attr:`filter-per-worker=True`, when enabling multi-process dataloader.
It is generally adivisable to use :obj:`filter_per_worker=True` when enabling multi-process dataloaders.
The workers prepare each :obj:`input_data` tensor: first by sampling the node indices using pre-defined sampler in :func:`collate_fn()` and secondly triggering :func:`filter_fn()`.
Filtering function selects node feature vectors from the complete input :class:`~torch_geometric.data.Data` tensor loaded into DRAM. This is a memory-expensive call which takes a significant time of each DataLoader iteration.
By default :attr:`filter-per-worker` is set to :attr:`False`, which causes that :func:`filter_fn()` execution is sent back to the main process. This can cause performance issues, because the main process will not be able to process all requests efficiently, especially with larger number of workers.
Expand Down