Computer Architecture Design (CAD) - Course Summary #3 (chapter3.pdf)
- How computers handle machine instructions
- What is Pipelining?
- Pipelining: A Mechanism to Increase the Throughput in General-Purpose Register Architecture Processors
- Example of a Pipelined RISC Processor (Cont’d)
- A RISC Data Path Drawn in a Pipeline Fashion
- A Pipeline with Pipeline Registers
- Basic Performance Issues in Pipelining
- Major Hurdle of Pipelining: Pipeline Hazards
- Performance of Pipelines with Stalls
- Performance of Pipelines with Stalls (cont’d)
- The Major Hurdle of Pipelining-Pipeline Hazards
- Structure Hazards
- A Processor with Only One Memory Port
- A Pipeline Stalled for a Structural Hazard
- Consideration about Structural Hazard
- Data Hazards
- Minimizing Data Hazard Stalls by Forwarding
- Data forwarding
- Implementation of Data Forwarding
- Data Hazards Requiring Stalls
- Pipeline Interlocking to Preserve Correct Execution
- Control (Branch) Hazards
- Reducing Pipeline Branch Penalties
- Scheduling the Branch Delay Slot
- Implementation of the MIPS Data Path (non-pipelined)
- How is Pipelining Implemented?
- Extending the MIPS Pipeline to Handle Muticycle Operations
- Example: MIPS R4000 Pipeline
- Conclusion
- Machine instructions are stored in a memory
- Program counter specifies a current executing instructions
- Computer has to read instructions from memory and decode them to know their operations
- The size of data and instructions are fixed, i.e., aligned in the memory
- Easy to identify the location of the next instruction without decoding the current instruction
- Program counter specifies a current executing instructions
- Data to be processed are stored in a memory or internal memory devices named Registers
- Necessary data are specified in machine instructions explicitly
- More flexibility in data placement in memory or registers
- Operations applied to data in registers to avoid long latency of memory accesses
- Load data from register and/or memory before operations
- Necessary data are specified in machine instructions explicitly
- Register-Register architecture
- All the data should be brought from Memory to internal registers before operations
- Data for operations are read from registers and the result is stored in a register.
- All the data should be brought from Memory to internal registers before operations
- RISC Architecture
- MIPS uses fixed size instructions (4-byte width), its memory addressing is Byte-addressed and Aligned.
- PC-relative addressing for branches and jumps
- Condition registers used
- Fixed-length instruction to gain implementation benefits while sacrificing average code size
- Position and their meanings of fields of a machine instruction are fixed
- Easy to implement a pipeline mechanism
- Computer takes an action based on the decoded instruction
- Load/store operations
- Data movement between register and memory
- Memory address calculation (addition/subtraction) is performed by ALU
- Data movement between register and memory
- Logical/arithmetic operations
- Logical/arithmetic operation for designated data
- For Add, sub, mul, div, and, or operations, ALU is, of course, used!
- Logical/arithmetic operation for designated data
- Control operations
- Change the content of program counter to change the execution point on the memory for branch and jump operations
- Address calculation is also needed by ALU.
- Change the content of program counter to change the execution point on the memory for branch and jump operations
- Load/store operations
- ALU is a key player in execution of any types of machine instructions
- Step 1: Instruction Fetch
- Read an instruction from the memory specified by program counter
- Step 2: Instruction Decode:
- Decode the instruction fetched
- Step 3: Execution (ALU operation)
- Take necessary actions based on the decoded instruction
- If the instruction is a load/store instruction, perform address calculation to obtain load/store memory address.
- If the instruction is a logic/arithmetic instruction, perform necessary operations for designated data
- If the instruction is a control instruction, perform address calculation to obtain the branch/jump target address
- Take necessary actions based on the decoded instruction
- Step 4: Memory access (only for load/store instructions)
- Store data to the memory, or load data from memory
- Step 5: Write result (only for logical/arithmetic&load instructions)
- Write the result of the ALU operation to the register/memory
- Key properties of GPRP instruction set
- All operations on data apply to data in registers and typically change the entire register
- The only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register, respectively.
- The instruction formats are few in number with all instructions typically being one size
- Five steps to execute an instruction
- Instruction fetch cycle (IF)
- Instruction decode/register fetch cycle (ID)
- Execution/effective address cycle (EX)
- Memory access (MEM)
- Write-back cycle (WB)
- All the instructions should be processed in 5 Steps
- Remember 5 Quantitative Principles of Computer Design!
- Take Advantage of Parallelism
- Principle of Locality
- Focus on the Common Case
- Amdahl’s Law
- The Processor Performance Equation
Pipelining: A Mechanism to Increase the Throughput in General-Purpose Register Architecture Processors
- Multiple instructions are overlapped in execution
- Process of instruction execution is divided into two or more steps, called pipe stages or pipe segments, and
- Different stage are completing different parts of different instructions in parallel
- The stages are connected one to the next to form a pipe
- Instructions enter at one end, progress through the stages, and exit at the other end
- It takes advantage of parallelism that exists among the actions needed to execute an instruction in a sequential instruction stream.
- Unlike some speedup techniques, it is not visible to the programmer/compiler.
- Throughput
- How often an instruction exits the pipeline
- Processor Cycle (Pipeline Cycle)
- The time required for moving an instruction one step down the pipeline
- Because all stages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage. The longest step would determine the time between advancing the line.
- Ideal time per instruction on the pipeline processor = $ \frac{ \text{Time per instruction on unpipelined machine}}{ \text{Number of pipe stages}} $
- Ideally, n times faster on n-stage pipeline, but usually the stages will not be perfectly balanced!
- In addition, Pipeline overhead: Latch delay and skew
Example on slide 15
- Structural hazards
- Arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution
- Data hazards
- Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline
- Control hazards
- Arise from the pipelining of branches and other instructions that change the PC
- Hazards in pipelines can make it necessary to stall the pipelines!
$ \text{Speedup from pipelining} = \frac{ \text{Average instruction time unpipelined} }{ \text{Average instruction time pipelined} } = \frac{ \text{CPI unpipelined} \times \text{Clock cycle unpipeline} }{ \text{CPI pipelined} \times \text{Clock cycle pipeline} } = \frac{ \text{CPI unpipelined} }{ \text{CPI pipelined} } \times \frac{ \text{Clock cycle unpipeline} }{ \text{Clock cycle pipeline} } $
Because the Ideal CPI on a pipelined processor is almost always 1,
$ \text{CPI pipelined} = \text{Ideal CPI} + \text{Pipeline stall clock cycles per instruction} = 1 + \text{Pipeline stall clock cycles per instruction} $
If clock cycles of pipelined and unpipelined are same, and there is no pipeline overhead,
In the simple case, the unpipelined CPI is equal to the depth of the pipeline
$ \text{Speedup} = \frac{ \text{Pipeline depth} }{ 1 + \text{ Pipeline stall cycles per instruction} } $
If pipelining improves the clock cycle time,
$ \text{Speedup} = \frac{ \text{CPI unpipelined} }{ \text{CPI pipelined} } \times \frac{ \text{Clock cycle unpipeline} }{ \text{Clock cycle pipeline} } = \frac{ 1 }{ 1 + \text{ Pipeline stall cycles per instruction} } \times \frac{ \text{Clock cycle unpipelined} }{ \text{ Clock cycle pipelined } } $
In cases where the pipe stages are perfectly balanced and there is no overhead,
$ \text{Clock cycle pipelined} = \frac{ \text{Clock cycle unpipelined} }{ \text{Pipeline depth} } $
$ \text{Pipeline depth} = \frac{ \text{Clock cycle unpipeline} }{ \text{Clock cycle pipeline} } $
Finally,
$ \text{Speedup} = \frac{ 1 }{ 1 + \text{Pipeline stall cycles per instruction} } \times \frac{ \text{Clock cycle unpipelined} }{ \text{Clock cycle pipelined} } = \frac{ 1 }{ 1 + \text{ Pipeline stall cycles per instruction} } \times \text{Pipeline depth} $
- Structural hazards occur when some combination of instructions cannot be accommodated because of resource conflicts.
- Some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute.
- Examples:
- Read access in ID and write access in WB to the register file
- A single-memory pipeline for data and instructions
Example "How much the load structural hazard might cost?" on slide 24.
- A processor without structural hazards will always have a lower CPI.
- Why would a designer allow structural hazard?
- Tradeoff between cost and performance gains
- Pipelining all the functional units, or duplicating them, may be too costly.
- Processors that support both an instruction and a data cache access every cycle require twice as much total memory bandwidth, and often have higher bandwidth at the pin.
- Pipelining all the functional units, or duplicating them, may be too costly.
- Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instruction on a unpipelined processor.
Example:
DADD R1, R2, R3
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10,R1, R11
Operation Destination, Source1, Source2
- If the result can be moved from the pipeline register where the DADD stores it to where the DSUB needs it, the need for a stall can be avoided.
- Data forwarding (bypassing) mechanism:
- The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs
- If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.
Example:
LD R1, 0(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
CPI increases by the length of the stall
- Execution of branch instructions may or may not change the PC to something other than its current execution sequence.
- Delayed Branch
- Branch instruction
- Seauential successor (branch delay slot)
- Simple (Non-Pipelined) Implementation of MIPS in 5 cycles
- Instruction Fetch (IF)
- IR←Mem[PC]
- NPC←PC+4
- Instruction decode/register fetch (ID)
- A←Regs[rs]
- B←Reg[rt]
- Imm←sing-extended immediate field of IR
- Execution/effective address calculation EX)
- Memory reference
- ALUOutput ← A + Imm
- Register-Register ALU inst.
- ALUOutput ← A func B
- Register-Imm. ALU inst.
- ALUOutput ← A op Imm
- Branch
- ALUOutput ← NPC + (Imm << 2)
- Cond ← (A == 0)
- Memory reference
- Simple (Non-Pipelined) Implementation (Cotnd)
- Memory access/branch completion (MEM)
- LMD ← Mem[ALUOutput] or
- Mem[ALUOutput] ← B
- Branch
- if (cond) PC ← ALUOutput
- LMD ← Mem[ALUOutput] or
- Write-back (WB)
- Register-Register ALU inst.
- Regs[rd] ← ALUOutput
- Register-Imm. ALU inst.
- Regs[rt] ← ALUOutput
- Load Instruction
- Regs[rt] ← LMD
- Register-Register ALU inst.
Stage | Any Instruction | ALU Instruction | Load/Store Instruction | Branch Instruction |
---|---|---|---|---|
IF | IF/ID.IR ← Mem[PC] IF/ID.NPC, PC ← (if ((EX/MEM.opcode == branch) & EX/MEM.cond) {EX/MEM.ALUOutput} else {PC+4}) |
|||
ID | ID/EX.A ← Regs[IF/ID.IR[rs]]; ID/EX.B ← Regs[IF/ID.IR[rt]]; ID/EX.NPC ← IF/ID.NPC; ID/EX.IR ← IF/ID.IR ID/EX.Imm ← sign-extend(IF/ID.IR[Immediate field]); |
|||
EX | EX/MEM.IR ← ID/EX.IR; EX/MEM.ALUOutput ← ID/EX.A func ID/EX.B; or EX/MEM.ALUOutput ← ID/EX.A op ID/EX.Imm; |
EX/MEM.IR ← ID/EX.IR; EX/MEM.ALUOutput ← ID/EX.A + ID/EX.Imm; EX/MEM.B ← ID/EX.B; |
EX/MEM.ALUOutput ← ID/EX.NPC + (ID/EX.Imm << 2); EX/MEM.cond ← (ID/EX.A == 0); |
|
MEM | MEM/WB.IR ← EX/MEM.IR; MEM/WB.ALUOutput ← EX/MEM.ALUOutput; |
MEM/WB.IR ← EX/MEM.IR; MEM/WB.LMD ← Mem[EX/MEM.ALUOutput]; or Mem[EX/MEM.ALUOutput] ← EX/MEM.B; |
||
WB | Regs[MEM/WB.IR[rd]] ← MEM/WB.ALUOutput; or Regs[MEM/WB.IR[rt]] ← MEM/WB.ALUOutput; |
For load only: Regs[MEM/WB.IR[rt]] ← MEM/WB.LMD; |
Situation | Example Code Sequence | Action |
---|---|---|
No dependence | LD R1, 45(R2) DADD R5, R6, R7 DSUB R8, R6, R7 OR R9, R6, R7 |
No hazard possible because no dependence exists on R1 in the immediately following three instructions. |
Dependence requiring stall | LD R1, 45(R2) DADD R5, R1, R7 DSUB R8, R6, R7 OR R9, R6, R7 |
Comparators detect the use of R1 in the DADD and stall the DADD (and DSUB and OR ) before the DADD begins EX. |
Dependence overcome by forwarding | LD R1, 45(R2) DADD R5, R6, R7 DSUB R8, R1, R7 OR R9, R6, R7 |
Comparators detect use of R1 in DSUB and forward the result of the load to ALU in time for DSUB to begin EX. |
Dependence with accesses in order | LD R1, 45(R2) DADD R5, R6, R7 DSUB R8, R6, R7 OR R9, R1, R7 |
No action required because the read of R1 by OR occurs in the second half of the ID phase, while the write of the loaded data occurred in the first half. |
| Instruction | Pipe Stages | | | | | | | | | | |-------------|------|------|------|------|------|------|------|------|------|------|------| | MUL.D | IF | ID |M1| M2 | M3 | M4 | M5 | M6 | M7 | MEM | WB | | ADD.D | | IF | ID |A1| A2 | A3 | A4 | MEM | WB | | | | L.D | | | IF | ID |EX| MEM| WB | | | | | | S.D | | | | IF | ID |EX|MEM| WB | | | |
Bold: where data are needed
Italic: where a result is available
- Because the divide unit is not fully pipelined, structural hazards can occur.
- These will need to be detected and issuing instructions will need to be stalled.
- Because the instructions have varying running times, the number of register writes required in a cycle can be larger than1.
- WAW (Write After Read) hazards are possible, since instructions no longer reach WB in order.
- Note that WAR (Write After Read) hazards are not possible, since the register reads always occur in ID
- Instructions can complete in a different order than they were issued, causing problems with exceptions
- Because of longer latency of operations, stalls for RAW (Read After Write) hazards will be more frequent.
- Check for structural hazards
- Wail until the required functional unit is not busy and make sure the register write port is available when it will be needed.
- Check for a RAW data hazard
- Wait until the source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result.
- Check for a WAW data hazard
- Determine if any instruction in A1, ..., A4, D, M1, ..., M7 has the same register destination as this instruction. If so, stall the issue of the instruction in ID.
- A deeper pipelin to achieve a higher clock rate
- Superpipelining:
- IF: Fist half of Instruction Fetch (PC selection, initial cache access)
- IS: Second half of Instruction Fetch (cache access completed)
- RF: Instruction Decode and Register Fetch, hazard checking, I$ hit detection
- EX: Execution, Effective Address Calc., ALU op. Br.Targt comput&Cond.Eval.
- DF: 1st half of Data Fetch (D$ access stated)
- DS: 2nd half of data fetch
- TC: Tag check, determine whether D$ hit
- WB: Write back
- Superpipelining:
- Pitfall: Unexpected execution sequences may cause unexpected hazards
- Ex. BNEX R1, foo DIV.D F0, F2, F4 ; moved into delay slot ; from fall through ... foo: L.D F0,qrs
- Pitfall: Extensive pipelining can impact other aspects of a design, leading to overall worse cost-performance
- Ex. VAX8600/8650/8700 implementations
- A simple implementation of 8700 realizes a smaller CPU with a faster clock cycle (but a bit longer CPI), resulting in the same performance with much less hardware.
- Ex. VAX8600/8650/8700 implementations
- Pitfall: Evaluating dynamic or static scheduling on the basis of unoptimized code
- Unpotimized code is much easier to schedule than “tight” optimized code.
- Pipelining is a simple and easy way to improve the performance.
- To maximize the effect of pipelining, avoid hazards that make the pipeline stall
- Enrich hardware resources to reduce resource conflicts
- Data forwarding between pipeline registers
- Fill the delay slots of load and branch with independent instructions
- Instruction reordering
- Branch prediction/speculation