Name	Name	Last commit message	Last commit date
Latest commit History 46 Commits
Code Test	Code Test
Imagines	Imagines
Reference Documents	Reference Documents
Report	Report
Source Codes	Source Codes
README.md	README.md

Parallel-Matrix-Multiplication-FOX-Algorithm

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

Reference Documents
Source Codes
Code Test
Report
Imagines

Brief Introduction to Parallel Matrix Multiplication FOX Algorithm

Basic Concepts

规约计算 (Reduction)
拥有者计算原则 (Owner Computing Rule)
流水并行(Pipeline Parallelism):
- 在一个进程上，矩阵计算被划分为P个阶段 (P Supercomputing Steps in a Process)
数据并行 (Data Parallelism):
- 在每个进程上同时计算局部的矩阵乘积 (Local Matrix Multiplications are computing on every processess at the same Computing Step)

Serial Matrix Multiplication

Mathematical Modeling of Matrix Multiplication
- $C_{ij}=\sum_{k=0}^{K-1} A_{ik}B_{kj}; \quad (i=0,N-1), \quad (j=0,M-1)$
Time Complexity
- $O\left ( N^{3} \right )$
Storage Complexity
- $O\left ( N^{3} \right )$
Example Implementation in C Language

for (i = 0; i < n; i++)                                      
        for (j = 0; j < n; j++)              
            for (k = 0; k < n; k++)
                C(i,j) = C(i,j) + A(i,k)*B(k,j);

Parallel Computing Modeling Design

Basic Flow

Matrix $\mathbf{A}$ 's Dimension is $M \times K$ , and Matirx $\mathbf{B}$ 's Dimension is a $K \times N$ .
Compute Matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ in parallel.
Let is the number of processors, and $q=\sqrt{p}$ be an integer such that it devides and .
Create a Cartesian topology with process mesh $P_{ij}$ , and , .
Denote $\hat{M} = \frac{M}{q}$ , $\hat{K}=\frac{K}{q}$ , $\hat{N}=\frac{N}{q}$ .
Distribute $\mathbf{A}$ and $\mathbf{B}$ by blocks on p processess such that $A_{ij}$ is $\hat{M} \times \hat{K}$ block and $B_{ij}$ is $\hat{K} \times \hat{N}$ block, stored on process $P_{ij}$ .

Details

Partitions of Matrices A, B and C. (Index syntax in Mathematical form: start from 1)
- Matrix A
- Matirx B
- Matrix C
Data Distribution on the 2-D Cartesian Topology Processes Mesh (Index syntax in Mathematical formulars: start from 1)
- Data Mapping
  - Data Mesh Mapping Process Mesh
- Partition may not perfect such that every sub-matrix is a square matrix. Yet, that's not a problem, except load unbalance on each process!
- Unbalanced Partition
  - Item Object
    
    Data Partition
    
    Data Partition
    
    Process Mesh
- Mathematical Modeling of Sub-Matirx Multiplication

Item	Object
Data Partition
Data Partition
Process Mesh

Parallel Algorithm Design on BSP

Parallelism type: Data parallelism with Pipeline parallelism

Rewrite the formula of Sub-Matirx Multiplication as q−1 Supercomputing Steps
- Stage Mathematical Operation
  
  0 $\mathbf{C_{ij}}=\mathbf{A_{ii}}\mathbf{B_{ij}}$
  
  1 $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+1}}\mathbf{B_{i+1j}}$
  
  2 $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+2}}\mathbf{B_{i+2j}}$
  
  ... ...
  
  q-2-i $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-2}}\mathbf{B_{q-2j}}$
  
  q-1-i $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-1}}\mathbf{B_{q-1j}}$
  
  ... $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i1}}\mathbf{B_{1j}}$
  
  ... $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i2}}\mathbf{B_{2j}}$
  
  ... ...
  
  q-1 $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii-1}}\mathbf{B_{i-1j}}$
- Data parallelism: Local Matrix Multiplication operation in each processes for each supercomputing step.

Stage	Mathematical Operation
0	$\mathbf{C_{ij}}=\mathbf{A_{ii}}\mathbf{B_{ij}}$
1	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+1}}\mathbf{B_{i+1j}}$
2	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+2}}\mathbf{B_{i+2j}}$
...	...
q-2-i	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-2}}\mathbf{B_{q-2j}}$
q-1-i	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-1}}\mathbf{B_{q-1j}}$
...	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i1}}\mathbf{B_{1j}}$
...	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i2}}\mathbf{B_{2j}}$
...	...
q-1	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii-1}}\mathbf{B_{i-1j}}$

Parallel Modeling Algorithm Operations on each step:

Stage	Algorithm Operation
0	1. Process $P_{ij}$ has $\mathbf{A_{ij}}$ , $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii}}$ (for each index ) 2. Process $\mathbf{P_{ii}}$ broadcast $\mathbf{A_{ii}}$ across process mesh row 3. Process $P_{ij}$ computes $\mathbf{C_{ij}=A_{ii}B_{ij}}$
1	1. $P_{ij}$ has $\mathbf{A_{ij}}$ and $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii+1}}$ and $\mathbf{B_{i+1j}}$ 1.1 Shift the block column of $\mathbf{B_{ij}}$ by one block up (block goes to block ) (period) 1.2 $P_{ii+1}$ broadcast $\mathbf{A_{ii+1}}$ across process mesh row 2. Process $P_{ij}$ Compute $\mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}$
2	1. $P_{ij}$ has $\mathbf{A_{ij}}$ and $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii+2}}$ and $\mathbf{B_{i+2j}}$ 1.1 Shift the block column of $\mathbf{B_{ij}}$ by one block up (block goes to block ) (period) 1.2 $P_{ii+2}$ broadcast $\mathbf{A_{ii+2}}$ across process mesh row 2. Process $P_{ij}$ Compute <img src="https://tex.s2cms.ru/svg/%5Cmathbf%7BC_%7Bij%7D%3DC_%7Bij%7D%2BA_%7Bii%2B2%7DB_%7Bi%2B2j%7D%7D" alt="\mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}"
...	...
q-2-i	$\mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}$
q-1-i	$\mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}$
...	$\mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}$
...	$\mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}$
...	...
q-1	$\mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}$