Skip to content

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

License

Notifications You must be signed in to change notification settings

yangyang14641/Parallel-Matrix-Multiplication-FOX-Algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel-Matrix-Multiplication-FOX-Algorithm

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

Contents

  1. Reference Documents
  2. Source Codes
  3. Code Test
  4. Report
  5. Imagines

Brief Introduction to Parallel Matrix Multiplication FOX Algorithm

Basic Concepts

  • 规约计算 (Reduction)
  • 拥有者计算原则 (Owner Computing Rule)
  • 流水并行(Pipeline Parallelism):
    • 在一个进程上,矩阵计算被划分为P个阶段 (P Supercomputing Steps in a Process)
  • 数据并行 (Data Parallelism):
    • 在每个进程上同时计算局部的矩阵乘积 (Local Matrix Multiplications are computing on every processess at the same Computing Step)

Serial Matrix Multiplication

  • Mathematical Modeling of Matrix Multiplication

    • C_{ij}=\sum_{k=0}^{K-1} A_{ik}B_{kj}; \quad (i=0,N-1), \quad (j=0,M-1)
  • Time Complexity

    • O\left ( N^{3} \right )
  • Storage Complexity

    • O\left ( N^{3} \right )
  • Example Implementation in C Language

for (i = 0; i < n; i++)                                      
        for (j = 0; j < n; j++)              
            for (k = 0; k < n; k++)
                C(i,j) = C(i,j) + A(i,k)*B(k,j);

Parallel Computing Modeling Design

  1. Basic Flow
  • Matrix \mathbf{A}'s Dimension is M \times K, and Matirx \mathbf{B}'s Dimension is a K \times N.
  • Compute Matrix \mathbf{C} = \mathbf{A}\mathbf{B} in parallel.
  • Let p=num(processors) is the number of processors, and q=\sqrt{p} be an integer such that it devides M and N.
  • Create a Cartesian topology with process mesh P_{ij}, and i=0..q-1, j=0..q-1.
  • Denote \hat{M} = \frac{M}{q}, \hat{K}=\frac{K}{q}, \hat{N}=\frac{N}{q}.
  • Distribute \mathbf{A} and \mathbf{B} by blocks on p processess such that A_{ij} is \hat{M} \times \hat{K} block and B_{ij} is \hat{K} \times \hat{N} block, stored on process P_{ij}.
  1. Details
  • Partitions of Matrices A, B and C. (Index syntax in Mathematical form: start from 1)

    • Matrix A

    • Matirx B

    • Matrix C

  • Data Distribution on the 2-D Cartesian Topology Processes Mesh (Index syntax in Mathematical formulars: start from 1)

    • Data Mapping

      • Data Mesh Mapping Process Mesh
    • Partition may not perfect such that every sub-matrix is a square matrix. Yet, that's not a problem, except load unbalance on each process!

    • Unbalanced Partition

      • Item Object
        Data Partition
        Data Partition
        Process Mesh
    • Mathematical Modeling of Sub-Matirx Multiplication

Parallel Algorithm Design on BSP

Parallelism type: Data parallelism with Pipeline parallelism

  1. Rewrite the formula of Sub-Matirx Multiplication as q−1 Supercomputing Steps

    • Stage Mathematical Operation
      0 \mathbf{C_{ij}}=\mathbf{A_{ii}}\mathbf{B_{ij}}
      1 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+1}}\mathbf{B_{i+1j}}
      2 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+2}}\mathbf{B_{i+2j}}
      ... ...
      q-2-i \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-2}}\mathbf{B_{q-2j}}
      q-1-i \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-1}}\mathbf{B_{q-1j}}
      ... \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i1}}\mathbf{B_{1j}}
      ... \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i2}}\mathbf{B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii-1}}\mathbf{B_{i-1j}}
    • Data parallelism: Local Matrix Multiplication operation in each processes for each supercomputing step.

  2. Parallel Modeling Algorithm Operations on each step:

    • Stage Algorithm Operation
      0 1. Process P_{ij} has \mathbf{A_{ij}}, \mathbf{B_{ij}} but needs \mathbf{A_{ii}} (for each index i)
      2. Process \mathbf{P_{ii}} broadcast \mathbf{A_{ii}} across process mesh row i
      3. Process P_{ij} computes \mathbf{C_{ij}=A_{ii}B_{ij}}
      1 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+1}} and \mathbf{B_{i+1j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q-1) (period)
      1.2 P_{ii+1} broadcast \mathbf{A_{ii+1}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}
      2 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+2}} and \mathbf{B_{i+2j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q−1) (period)
      1.2 P_{ii+2} broadcast \mathbf{A_{ii+2}} across process mesh row i
      2. Process P_{ij} Compute <img src="https://tex.s2cms.ru/svg/%5Cmathbf%7BC_%7Bij%7D%3DC_%7Bij%7D%2BA_%7Bii%2B2%7DB_%7Bi%2B2j%7D%7D" alt="\mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}"
      ... ...
      q-2-i \mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}
      q-1-i \mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}
    • Pipe parallelism: The (0 ~ q−1) Computing Steps for each process P_ij.

Algorithm Analysis

About

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published