☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System
- Reference Documents
- Source Codes
- Code Test
- Report
- Imagines
- 规约计算 (Reduction)
- 拥有者计算原则 (Owner Computing Rule)
- 流水并行(Pipeline Parallelism):
- 在一个进程上,矩阵计算被划分为P个阶段 (P Supercomputing Steps in a Process)
- 数据并行 (Data Parallelism):
- 在每个进程上同时计算局部的矩阵乘积 (Local Matrix Multiplications are computing on every processess at the same Computing Step)
-
Mathematical Modeling of Matrix Multiplication
-
Time Complexity
-
Storage Complexity
-
Example Implementation in C Language
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
C(i,j) = C(i,j) + A(i,k)*B(k,j);
- Basic Flow
- Matrix
's Dimension is
, and Matirx
's Dimension is a
.
- Compute Matrix
in parallel.
- Let
is the number of processors, and
be an integer such that it devides
and
.
- Create a Cartesian topology with process mesh
, and
,
.
- Denote
,
,
.
- Distribute
and
by blocks on p processess such that
is
block and
is
block, stored on process
.
- Details
-
Partitions of Matrices A, B and C. (Index syntax in Mathematical form: start from 1)
-
Data Distribution on the 2-D Cartesian Topology Processes Mesh (Index syntax in Mathematical formulars: start from 1)
-
Data Mapping
-
Partition may not perfect such that every sub-matrix is a square matrix. Yet, that's not a problem, except load unbalance on each process!
-
Unbalanced Partition
-
Mathematical Modeling of Sub-Matirx Multiplication
-
Parallelism type: Data parallelism with Pipeline parallelism
-
Rewrite the formula of Sub-Matirx Multiplication as q−1 Supercomputing Steps
-
Parallel Modeling Algorithm Operations on each step:
-
Stage Algorithm Operation 0 1. Process has
,
but needs
(for each index
)
2. Processbroadcast
across process mesh row
3. Processcomputes
1 1. has
and
but needs
and
1.1 Shift theblock column of
by one block up (block
goes to block
) (period)
1.2broadcast
across process mesh row
2. ProcessCompute
2 1. has
and
but needs
and
1.1 Shift theblock column of
by one block up (block
goes to block
) (period)
1.2broadcast
across process mesh row
2. ProcessCompute <img src="https://tex.s2cms.ru/svg/%5Cmathbf%7BC_%7Bij%7D%3DC_%7Bij%7D%2BA_%7Bii%2B2%7DB_%7Bi%2B2j%7D%7D" alt="\mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}"
... ... q-2-i q-1-i ... ... ... ... q-1 -
Pipe parallelism: The (0 ~ q−1) Computing Steps for each process P_ij.
-
-