-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] CPU Performance Optimization for PyG #4891
Comments
Thanks for this detailed issue and roadmap. PyTorch recently released |
Ok, I see. Then it is better to optimize scatter_reduce in torch. Just checked the code, scatter_add and scatter_reduce share the same kernel in torch so they have the same performance issues. Will have it fixed. |
hey @mingfeima this issue is great, a lot of detail.
I'm a complete newbie to this, so my question is to learn not suggest something. Can you explain what you're intending to change here? It sounds like you want to keep it sequential but maybe the actual sampling itself parallel? |
Yes, that's the idea! The data loader from pytorch is more suitable for GPU (by setting num_workers=N, it will launch data loading threads asynchronously with the main computation thread). On CPU, it is probably better to run data loading and computation in sequential while parallel the sampler from in the data loader with OpenMP threads. |
That makes complete sense. Let me know if you'd like any help (I'd likely not be quick though 😅) |
@mingfeima I assume the benchmarks have been run with |
current benchmark profiling result uses the default setting. Some scripts, for example to_hetero_mag would explicitly set the
The second part takes more time, so it is still possible to be improved with single worker + parallel openmp. If we use Actually the data loader optimization is a rather complexed issue, perhaps more complexed than optimizing the kernels :( since it is more likely a tuning job to achieve the most balanced situation between workload payload (memory footprint, computation complexity etc.) and hardware capacity (IO, memory bandwidth, ALU flops). Usually we do not do data loading optimizations since the real case in deployment would probably be even more complexed (some venders have mechanisms like prefetching, batching to improve overall user experience and efficiency). But the thing is DGL has done some optimizations here so we need to at least something similar, otherwise out of box performance on PyG would look bad. Anyway, we will make sure that openmp have correct settings either |
Updates on InitiativeDepending type of the Problem description
A typical input shape for the dataset of reddit looks like:
So we pick rows from 477k indices and update dst index in AlgorithmThere exists a couple of algorithms to fix the write conflict such as: a) sorting; b) segment mutex; c) atomic; d) shared buffer, ... So,
ResultI used the SAGE+reddit from https://github.com/pyg-team/pytorch_geometric/blob/master/examples/reddit.py For inference, on ICX Xeon single socket, 20 cores @2.50GHz. End to end inference time reduced from
There still some TODOs to follow up which will bring some additional performance improvement:
|
Just to understand: Does this mean that we first sort |
Yes, that's the idea. The overhead is not only sorting, also we have to calculate the row_ptr indices... |
As far as I understand, this would refer to a |
Yes, the current Because of the semantics limitation, we can not skip sorting since from PyTorch side there is no guarantee that As for the optimization techniques of The overhead is actually not mainly from the sorting itself but from memory allocation, i will switch to c10 cpu allocator to see if it helps. Aside from that, probably we have a chance to cache the sorted index so as to save sorting for consecutive layers. Maybe add a attribute called Only a rough idea at the moment, my point is that we firstly clear the performance bottlenecks from torch (so we optimize |
Got it, thanks for clarifying! |
Update on Port Now only Select benchmark from
To break down the optimization scheme a little bit:
The
We can see that the first 2 threads have more payload than others, need to balance the thread payload here. Normally we can use dynamic scheduling for omp, but this won't fit into pytorch's |
Update on Training optimizationOptimize When the GCN-Reddit (before)GCN-Reddit (after) |
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) [ghstack-poisoned]
Sort out the 2nd stage optimization a little bit:
|
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) Pull Request resolved: #87586 Approved by: https://github.com/rusty1s, https://github.com/malfet
🚀 The feature, motivation and pitch
The goal of this roadmap is to optimize CPU performance for PyG (including
torch_scatter
,torch_sparse
).For the first step, we will start with single node inference performance optimization on:
Next step will extend to optimization effort to (distributed) training.
Performance Profiling
CPU platform: Icelake Xeon
Generic benchmarking
torch_sparse::spmm_sum
96.04%)DataLoader
83.49%,aten::scatter_add_
8.47%)DataLoader
59.83%,aten::scatter_add_
24.76%)aten::scatter_add_
27.61%,DataLoader
25.70%,aten::index
20.26%)aten::scatter_add_
30.91%,torch_scatter::scatter_max
24.54%,aten::mm
10.34%,aten::index_select
6.71%) most of models under pytorch_geometric/benchmark/citation have similar behavior from performance perspective.aten::addmm
21.69%,aten::scatter_add_
20.60%,aten::index_select
13.48%,DataLoader
12.31%)orch_scatter::scatter_max
39.34%,torch_scatter::scatter_min
39.25%); need follow up: need to get scatter_reduce tensor shape/stride (similar issue as aten::scatter_add_?)torch_scatter::scatter_max
66.91%,torch_cluster::knn
23.56%) sourcebenchmark/points/edge_cnn.py
torch_scatter::scatter_max
torch_scatter::scatter_max 53.61%,aten::index_select
21.73%,DataLoader
16.11%) source from [Benchmark] Adding inference benchmark suite #4915Large dataset benchmarking
DataLoader (with preprocess of input data) is the major bottleneck here, mostly
from_numpy
(246s) andto
(169s) triggered by data type conversion, source from convert_batch.Performance Hotspots
NeighborSampler
).Python level API upgrade in model scripts
The DataLoader is a major hotspot so the first step is to upgrade DataLoader from
NeightborSampler
toNeighborLoader
which has native C++ impelemtation:Native level kernel optimization
Phase One Optimizations
NeighborLoader
parallelization: the current impl is sequential (probably to avoid oversubscription with multiple workers on the data loader). Unlike GPU runs, asynchronously run data loading thread and computation thread does not always make sense. On some occasions, run data loading step and computation step sequentially while making each of the torch operator parallel on OpenMP (which is case of intra-parallelism) makes more sense. Hotspot ontorch_sparse::neighbor_sample
.aten::sort
: GCN + ogbn-products spent roughly 1/3 time on sort in the preprocessing step (which is not covered during profiler result for the model inference), introduced by indexing from sparse tensor at gnn.py#L123. Root cause is aten::sort(dim) could only be paralleled on dimensions != dim, and the grain size is not correctly set. Fixed by #74897.spmm_{max|mean|sum}
(torch_sparse). Add vectorization and prefetch (indirect memory access) and apply blocking on M and K (if necessary).scatte_add
andscatter_max
(torch_scatter). Optimizedscatter_add
(with extended index) with #82703. Still need more polishing work.the current impl for
scatter_add
will try to parallel on the inner dimension to avoid write conflict; while ideally we should try to parallel on the outer dimension and vectorize on the inner dimension, yet need to resolve the write conflict on the output tensor. Experiment different impls for the given input range.index_select
, optimized via #76868.index
, directly optimizeindex
would be difficult, maybe we can change it to more performance ops likeindex_select
from NeighborLoader or customize its kernel from NeighborLoader.Phase Two Optimizations
scatter_reduce
segment_reduce
and align the reduce types between scatter_reduce, spmm_reduce, segment_reduce.TensorTypeId
ofCPU
andSparseCPU
.scatter_add
: cache the sorted index.knn
(torch_cluster), need follow up shape info to determine proper method to parallel the kernel. Evaluate knn fromoneAPI
dal.Design option for vectorization
To vectorize kernels from torch-sparse and torch-scatter, we have multiple options:
#pragma omp simd
and add a compiler flagmarch=skylake-avx512
but this won't apply bfloat16 (bfloat16 is a overload of uint16 and won't be vectorized properly by compiler)at::vec::Vectorized<scalar_t>
, this will apply to bfloat16 but we need to customize the cmake scripts to make it compatible with PyTorch's cpu build flags: _DEFAULT(scalar code), _AVX2 and _AVX512.at::vec::Vectorized<scalar_t>
will work without any change but need to move the operator from torch-sparse/torch-scatter to torch. Makes more sense for the fused kernel of GAS.(current decision is to go with option 3 as much as we can)
Bfloat16 enabling in torch-sparse/torch-scatter
(highly related to the vectorization method choosn)
Validation
The text was updated successfully, but these errors were encountered: