High-Performance Computing (HPC) - Course Summary #1 (hpc_01.pdf)
- Hiroyuki TAKIZAWA (滝沢寛之)
- Processor at Cyberscience Center, Tohoku University
- Research Interests: HPC software and systems, computer architectures, parallel and distributed computing systems, and applications
- Contact: takizawa@tohoku.ac.jp
- Class Schedule
- Introduction to HPC
- Supercomputers and Parallel Computing
- The Modern Scientific Method
- Target Applications
- Floating-Point Operations (FLOPS)
- Parallel Computing
- Parallel Architectures
- Power Calculation Example
- System Overview
- Software Overview
- Benefits of Parallel Computing
- Importance of Application Parallelism
- Seeking Concurrency
- Parallelism in Graphs
- Parallelization Ratio and Amdahl’s Law
- Parallel Programming
- Accelerators and Heterogeneous Computing
- System Performance Metrics
- Heterogeneous Computing
- Latency-Oriented Features
- Cache Memory
- Branch Prediction
- Strategies for High Throughput
- Conclusion
- 1-Oct: Introduction to HPC (1)
- 1-Oct: Introduction to HPC (2)
- 8-Oct: Parallel Architectures
- 8-Oct: How to use Supercomputer AOBA
- 15-Oct: Parallel Algorithm Design (1)
- 15-Oct: Parallel Algorithm Design (2)
- 5-Nov: MPI Programming (1)
- 5-Nov: MPI Programming (2)
- 12-Nov: OpenMP Programming (1)
- 12-Nov: OpenMP Programming (2)
- 26-Nov: Hybrid Programming
- 26-Nov: Performance Modeling and Analysis
- HPC systems, often called supercomputers, use parallel computing to perform many operations at once.
- The importance of parallelism in modern hardware and software.
- Methods for exploiting parallelism and the impact of software parallelism.
- A supercomputer is designed to perform a vast number of floating-point operations (FLOPS).
- Key applications include simulation, AI, and machine learning, which have a high demand for computing power.
- Traditional scientific methods involve hypothesis, experimentation, and observation.
- Modern methods incorporate simulation and data science to predict and analyze complex systems.
- Scientific problems like boundary value problems and heat conduction simulations.
- Growth in AI and ML applications driving demand for larger datasets and computational power.
- A supercomputer's performance is measured by the number of floating-point operations per second (FLOPS).
- A number is represented by a mantissa (significand) and an exponent, similar to scientific notation
- Representable range extended
- Complicated processing needed for arithmeticoperations
- High performance in numerical simulations requires the ability to handle a large number of floating-point calculations quickly.
- The performance of a supercomputer is discussed based on how many floating-point operations the supercomputer can execute per unit time.
Chinese Characters (KANJI) | Value | SI Prefixes (Intl. Syst. of Unit) | Value |
---|---|---|---|
万 (man) | 10^4 | K (kilo) | 10^3 |
億 (oku) | 10^8 | M (mega) | 10^6 |
兆 (chou) | 10^12 | G (giga) | 10^9 |
京 (kei) | 10^16 | T (tera) | 10^12 |
垓 (gai) | 10^20 | P (peta) | 10^15 |
秭/禰 (jo) | 10^24 | E (exa) | 10^18 |
穣 (jou) | 10^28 | Z (zetta) | 10^21 |
... | ... | Y (yotta) | 10^24 |
無量大数 | 10^68 |
Note: K Computer = 10 Pflop/s = 10^16
- The practice of executing many operations simultaneously at a time by identifying and exposing parallelism in algorithms. Expressing the parallelism in the software implementation necessite to understanding the costs, benefits, and limitations of the chosen implementation.
- Parallel computing focuses on performance with different goals such as speedup, problem size scaling, and energy efficiency.
- Modern computers no longer rely on increasing clock speed for performance, but instead, add more processing cores (parallelism).
- Parallel computing systems save power by distributing tasks across multiple processors running at lower frequencies.
Where:
-
$P_d$ : dynamic power consumption -
$f$ : clock frequency -
$V_{dd}$ : power-supply voltage
Note
To illustrate the power-saving benefits of parallel computing, consider the following two system configurations:
-
System 1: A single processor with a clock frequency of 2 GHz and power-supply voltage of 1.5V.
$f = 2GHz, V_{dd} = 1.5V, Nbr_Core = 1$ $P_{d} = \alpha \times 2000 \times 1.5^2 \times 1 = 4500 \alpha$ -
System 2: Four processors, each with a clock frequency of 500 MHz and power-supply voltage of 0.8V.
$f = 0.5GHz, V_{dd} = 0.8V, Nbr_Core = 4$ $P_d = \alpha \times 500 \times 0.8^2 \times 4 = 1280 \alpha$ If the 4 processors of System 2 work perfectly in parallel, they achieve the same performance as System 1, but with only 28% of the power consumption. This demonstrates that parallel computing is power-effective.
- CPU-GPU Heterogeneous Systems: Systems are composed of multiple CPUs and GPUs, connected by high-speed networks.
- CPUs are designed for latency-oriented tasks while GPUs are optimized for throughput-oriented tasks.
- Contains multiple Cores for processing.
-
Vector Unit:
- Capable of loading and storing
$X$ values at a time. - Performs parallel operations on these values using SIMD (Single Instruction, Multiple Data).
- Capable of loading and storing
-
Array Processing:
- Processes arrays by loading and storing chunks of values (e.g., 4 values at a time).
-
AOBA:
- A vector computer that can process 256 values with a single instruction.
- Processes and Threads: Software is executed by multiple processes, each assigned to a node and run on one or more cores.
- Software must be written to exploit thread-level parallelism (OpenMP) and process-level parallelism (MPI).
When a program is launched, several kinds of resources such as CPU time and memory are allocated for the execution. A unit of allocated resource is called a process. An application (user program) is executed by multiple processes.
Each process is assigned to a node, which runs the OS, such as Linux.
The OS on each node decides the core(s) to execute the process. A process can be executed by multiple cores. The execution sequence on each core is called a thread.
- Faster runtime: More cores reduce execution time.
- Larger problem sizes: More compute nodes allow solving larger problems.
- Energy efficiency: More but slower cores can perform computations while consuming less power.
Energy = (No. CPUs) x (Power/CPU) x (Hour)
- Successful parallel computing depends on well-parallelized applications. Applications need to be optimized to take advantage of hardware parallelism.
-
Vertex:
- Represents a task that needs to be completed.
-
Edge:
- Directed from task u to task v, meaning task u must be completed before task v begins.
-
Concurrency:
- If there is no edge between tasks u and v, they are independent and may be performed concurrently.
-
Parallel Execution:
- Multiple tasks are executed physically at the same time.
-
Concurrent Execution:
- Multiple tasks can be in progress at the same time, but not necessarily executed at the same moment.
-
Data Parallelism:
- Multiple tasks (e.g., B, B, B) can be processed in parallel when derived from the same parent node (A) and before moving to subsequent dependent tasks (e.g., C).
-
Functional Parallelism (Task Parallelism):
- Different tasks (e.g., C, D, E) stemming from different parts of the graph can be executed in parallel.
-
No Parallelism:
- A linear dependency structure (A → B → C) does not allow for any parallelism as each task depends on the completion of the previous one.
Amdahl’s Law is a fundamental principle in parallel computing that describes the potential speedup from parallelizing a task. The formula is:
Where:
-
$\alpha$ is the fraction of the program that can be parallelized. -
$n$ is the number of processors.
- Parallel programming involves writing software that explicitly indicates the parallel portions of the program that can be executed concurrently.
- Standards like OpenMP and MPI are essential for writing parallel programs.
- Accelerators like GPUs and many-core processors (e.g., Intel Xeon Phi) are used to enhance computational performance.
- These accelerators must be programmed to exploit both parallelism and the heterogeneity of hardware resources.
- Latency: The time to complete a single task (shorter is better).
- Throughput: The number of tasks completed in a given time (higher is better).
System 1 is better in terms of latency.
System 2 is better in terms of throughput.
- CPUs have a large cache memory and a control unit.
- Latency-oriented design is focused on minimizing delays and managing speculative tasks.
- GPUs allocate more hardware resources to ALUs (Arithmetic Logic Units).
- Throughput-oriented design is aimed at maximizing the number of parallel tasks executed at once.
- CPU:
- Contains Control, Cache, ALUs, and DRAM.
- GPU:
- Focuses heavily on ALUs with large amounts of resources allocated to parallel processing, with DRAM for memory.
- Cache Memory: Stores data that is likely to be accessed soon, reducing latency.
- Out-of-Order (OoO) Execution and Branch Prediction:
- Predicts upcoming branches to execute speculatively, optimizing future operations.
- Memory is much slower than CPU:
- To mitigate this, CPUs use cache memory to prepare for future memory accesses.
- Read data from memory.
- Process the data.
- Write the data back to memory.
- Caching Process:
- Without cache: The speed is limited by memory access times.
- With cache: The speed is enhanced by reducing the frequency of accessing slower main memory.
- Data to be cached:
- Determined based on the locality of reference, which predicts that the same memory locations will be used frequently.
- Cache Hit Ratio:
- Increasing the cache hit ratio is critical for improving performance in modern processors.
- Cache memory occupies a significant portion of the chip to ensure high hit ratios.
- Goal: Predict the branch target to execute either A or B speculatively.
- This allows the CPU to continue working without waiting for the branch decision to be finalized.
if ( i == 0 ){
A;
}
B;
for ( i = 0; i < N; i++ ){
A;
}
B;
- Conditional Branch:
- If a condition is met (e.g.,
i == 0
), the processor executes A, otherwise B. - There is often a statistical bias in the branch targets that can be leveraged to improve prediction accuracy.
- If a condition is met (e.g.,
- Loop:
- For loop structures (e.g.,
for (i = 0; i < N; i++)
), A is executed repeatedly, followed by B once the loop ends.
- For loop structures (e.g.,
- Speculative Execution:
- Reduces latency by avoiding pipeline stalls when the processor predicts which branch will be taken.
- This makes the processor more complex and increases the hardware cost of each core.
- No speculation features:
- These cores are designed to be simple, avoiding complex speculative execution to increase the total number of cores on a chip.
- Many tasks (threads) executed in parallel:
- Multiple threads are executed simultaneously, though the execution speed of an individual thread may be slower.
- Context Switching:
- When a thread is stalled (e.g., waiting for memory access), another thread is executed on the core to keep it active. This is referred to as a context switch.
- Hiding Memory Access Latency:
- Memory access latency is hidden by allowing another thread to execute while one is waiting for memory.
- Use of Registers:
- A large number of registers are used to reduce the overhead caused by context switching, allowing for smoother transitions between threads.
- Shows how Core A performs a mix of computations and memory accesses, switching between threads as needed, to maintain high throughput over time.
- This course introduces HPC and covers both theoretical and practical aspects of high-performance computing.
- Topics include parallelism in modern hardware, application optimization for parallel execution, and the use of tools like OpenMP and MPI.
- Key focus areas include power efficiency, system architectures, and parallel programming techniques.
If 80% of a task (