Skip to content

Anand-786/tomasulo-algorithm-implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👉 Related project : Victim Cache Analysis in gem5

Tomasulo Simulator

An Out-of-Order CPU simulator implementing Tomasulo’s Algorithm with LSQ, ROB, L1 Data Cache, and Victim Cache. Features configurable parameters, cycle-by-cycle tracing, statistics generation, and Gantt chart visualization. Designed for education and computer architecture research.

Project Showcase

Simulator Startup Output

Gantt Chart Visualization Cycle-by-Cycle Trace


System Architecture

System Architecture Diagram

Source: ReViCe: Reusing Victim Cache to Prevent Speculative Cache Leakage - Scientific Figure on ResearchGate

Components and Features

Component / Feature Config Parameter(s)
Reorder Buffer (ROB) rob_size
Load/Store Queue (LSQ) lsq_size
L1 Data Cache num_sets, associativity, block_size
Victim Cache victim_cache_size, vc_access_latency
ALU RS num_alu_rs, alu_latency
MUL RS num_mul_rs, mul_latency
DIV RS num_div_rs, div_latency
Memory Subsystem mem_fetch_latency, wb_latency
Program Input program_file_path
Cycle-by-Cycle Trace trace_file_enabled
Gantt Chart gantt_chart_enabled
Statistics Generation (auto-generated)
Register File (fixed)

Experiments & Results

I performed 3 main experiments on the Simulator. Here are their objectives, setup, results and key takeaways.

1. Impact of ROB Size vs. IPC

Objective:

To study how the benefit of increasing the Re-Order Buffer (ROB) size changes under workloads with different instruction latencies.

Experimental Setup:

Parameter Value
MUL Latency 1 cycle(low), 5 cycles(high)
Reservation Stations 32
L1-D size 32 KB
#iterations 100

Key Takeaway:

When instruction latencies are high (e.g., long MUL operations), a larger ROB allows more overlap and better latency hiding, leading to clear performance gains. For low-latency instructions, however, the performance improvement saturates quickly, and increasing the ROB further offers little additional benefit.


2. Victim Cache Effectiveness and Thrashing Point

Objective:

To evaluate the effectiveness of a Victim Cache (VC) in reducing conflict misses penalties and to identify the point at which it stops being useful.

Experimental Setup:

Parameter Value
L1 D-Cache 4KB, Direct-Mapped, 64B Block
Victim Cache 4 Entries
Mem Access Penalty 20 cycles
VC Hit Penalty 3 cycles

Key Takeaway:

The VC substantially reduced the Average Memory Access Time (AMAT) up to the point where the number of conflicting addresses exceeded its capacity (around N=6). Beyond this point, thrashing occurred and the benefit disappeared. This indicates that the VC works well in mitigating conflict misses, but only within the limits of its small storage capacity.


3. LSQ Effectiveness: Store-to-Load Forwarding

Objective:

To test whether the Load-Store Queue (LSQ) correctly performs Store-to-Load Forwarding (STLF), using a workload with explicit memory dependencies.

Experimental Setup:

Parameter Value
L1 D-Cache 1KB, Direct-Mapped, 32B Block
LSQ Size 32 Entries
Victim Cache Disabled
STLF Enabled

Key Takeaway:

With STLF enabled, dependent loads were able to receive data directly from earlier stores, avoiding cache access delays. This resulted in a 2.93× increase in IPC compared to the baseline workload. The result shows that STLF is not just a correctness feature but also an important optimization for workloads with memory-based dependencies.


Usage

For full setup and run instructions, see Usage Guide.

Sample Outputs

Below is an example of the simulator’s terminal output when starting a run :



    $$$$$$$$\  $$$$$$\  $$\      $$\  $$$$$$\   $$$$$$\  $$\   $$\ $$\       $$$$$$\         $$$$$$\  $$$$$$\ $$\      $$\ $$\   $$\ $$\        $$$$$$\ $$$$$$$$\  $$$$$$\  $$$$$$$\ 
    \__$$  __|$$  __$$\ $$$\    $$$ |$$  __$$\ $$  __$$\ $$ |  $$ |$$ |     $$  __$$\       $$  __$$\ \_$$  _|$$$\    $$$ |$$ |  $$ |$$ |      $$  __$$\\__$$  __|$$  __$$\ $$  __$$\ 
       $$ |   $$ /  $$ |$$$$\  $$$$ |$$ /  $$ |$$ /  \__|$$ |  $$ |$$ |     $$ /  $$ |      $$ /  \__|  $$ |  $$$$\  $$$$ |$$ |  $$ |$$ |      $$ /  $$ |  $$ |   $$ /  $$ |$$ |  $$ |
       $$ |   $$ |  $$ |$$\$$\$$ $$ |$$$$$$$$ |\$$$$$$\  $$ |  $$ |$$ |     $$ |  $$ |      \$$$$$$\    $$ |  $$\$$\$$ $$ |$$ |  $$ |$$ |      $$$$$$$$ |  $$ |   $$ |  $$ |$$$$$$$  |
       $$ |   $$ |  $$ |$$ \$$$  $$ |$$  __$$ | \____$$\ $$ |  $$ |$$ |     $$ |  $$ |       \____$$\   $$ |  $$ \$$$  $$ |$$ |  $$ |$$ |      $$  __$$ |  $$ |   $$ |  $$ |$$  __$$< 
       $$ |   $$ |  $$ |$$ |\$  /$$ |$$ |  $$ |$$\   $$ |$$ |  $$ |$$ |     $$ |  $$ |      $$\   $$ |  $$ |  $$ |\$  /$$ |$$ |  $$ |$$ |      $$ |  $$ |  $$ |   $$ |  $$ |$$ |  $$ |
       $$ |    $$$$$$  |$$ | \_/ $$ |$$ |  $$ |\$$$$$$  |\$$$$$$  |$$$$$$$$\ $$$$$$  |      \$$$$$$  |$$$$$$\ $$ | \_/ $$ |\$$$$$$  |$$$$$$$$\ $$ |  $$ |  $$ |    $$$$$$  |$$ |  $$ |
       \__|    \______/ \__|     \__|\__|  \__| \______/  \______/ \________|\______/        \______/ \______|\__|     \__| \______/ \________|\__|  \__|  \__|    \______/ \__|  \__|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                       =========================[ An Out-of-Order CPU Simulator based on Tomasulo's Algorithm with L1 & Victim Cache ]=========================


       
 [ CONFIGURATION ]
 +------------------------------------------------------------+
 | > L1 Cache Properties :                                    |
 |     - Size                  : 4 KB                         |
 |     - #Sets                 : 32                           |
 |     - Associativity         : 2-way                        |
 |     - Block Size            : 64 B                         |
 |     - Physical Address      : 32 bits                      |
 |     - Victim Cache Size     : 8                            |
 |     - Victim Cache Access   : 3 cycles                     |
 |     - Memory Fetch Latency  : 10 cycles                    |
 |     - Writeback Latency     : 5 cycles                     |
 |                                                            |
 | > Simulator Properties :                                   |
 |     - Program File          : program.asm                  |
 |     - #Iterations           : 1                            |
 |     - Victim Cache          : Enabled                      |
 |     - Logs                  : Enabled                      |
 |     - Gantt Chart           : Enabled                      |
 |     - Show Status Table     : On                           |
 |     - ROB entries           : 16                           |
 |     - LSQ entries           : 8                            |
 |     - Reservation Stations  : 16 (ALU), 8 (MUL), 4 (DIV)   |
 |     - Latencies (cycles)    : 1 (ALU), 5 (MUL), 10 (DIV)   |
 +------------------------------------------------------------+


 [ INFO ] Executing Program...
 [ DONE ] Simulation Completed Successfully.


 [ FINAL INSTRUCTION STATUS ]
 +-------------------------------------------------------------------------------------+
 |  Instruction               | Issue  | ExecStart  | ReadMem  | WriteCDB  | Commit    | 
 |-------------------------------------------------------------------------------------| 
 |  LOAD  R2, 0(R10)          | 1      | 2          | 12       | 13        | 14        | 
 |  LOAD  R3, 0(R12)          | 2      | 3          | 13       | 14        | 15        | 
 |  ADD   R4, R10, R11        | 3      | 4          | -        | 5         | 16        | 
 |  MUL   R5, R12, R13        | 4      | 5          | -        | 10        | 17        | 
 |  LOAD  R6, 0(R14)          | 5      | 6          | 14       | 15        | 18        | 
 |  LOAD  R7, 0(R16)          | 6      | 7          | 15       | 16        | 19        | 
 |  ADD   R8, R14, R15        | 7      | 8          | -        | 9         | 20        | 
 |  MUL   R9, R16, R17        | 8      | 10         | -        | 17        | 21        | 
 |  LOAD  R18, 0(R18)         | 9      | 10         | 16       | 18        | 22        | 
 |  LOAD  R19, 0(R20)         | 10     | 11         | 17       | 19        | 23        | 
 |  ADD   R20, R18, R19       | 11     | 20         | -        | 21        | 24        | 
 |  MUL   R21, R20, R21       | 12     | 22         | -        | 27        | 28        | 
 |  LOAD  R22, 0(R22)         | 13     | 14         | 18       | 20        | 29        | 
 |  LOAD  R23, 0(R24)         | 14     | 15         | 19       | 22        | 30        | 
 |  ADD   R24, R22, R23       | 15     | 23         | -        | 24        | 31        | 
 |  MUL   R25, R24, R25       | 16     | 27         | -        | 32        | 33        | 
 |  BEQ   R1, R0, LOOP        | 17     | 18         | -        | -         | 34        | 
 +-------------------------------------------------------------------------------------+


 [ SIMULATION SUMMARY ]
 +----------------------------------------------------------+
 |  - Total Execution Cycles          : 34                  |
 |  - Total Instructions Committed    : 17                  |
 |  - IPC (Instructions/Cycle)        : 0.50                |
 |  - Wall-Clock Time                 : 5.00 ms             |
 |  - Avg. Memory Access Time[AMAT]   : 2.12 cycles         |
 |  - L1 Cache Miss Rate              : 12.50 %             |
 |  - Avg. L1 Miss Penalty            : 9.00 cycles         |
 |  - Avg. Instruction Latency        : 14.18 cycles        |
 +----------------------------------------------------------+


 [ SIMULATION OUTPUT FILES ]
 +----------------------------------------------------------+
 |  > Detailed trace log:      results/trace.log            |
 |  > Full statistics report:  results/stats.txt            |
 +----------------------------------------------------------+



Gantt Chart Visualization

The simulator generates a Gantt chart to visualize instruction progress through pipeline stages.

Sample Gantt Chart

Each row corresponds to an instruction, and colored bars denote its active stage per cycle, making pipeline overlap and stalls easy to inspect.

Trace Log (Excerpt)

Below is a snippet from the generated trace.log file, showing the cycle-by-cycle execution of instructions :

Integer-ALU Reservation Stations
Busy  | Op      | Qj      | Qk      | Vj      | Vk      | Dest  
-----------------------------------------------------------------

Multiplier Reservation Stations
Busy  | Op      | Qj      | Qk      | Vj      | Vk      | Dest  
-----------------------------------------------------------------
Yes   | MUL     | ROB 14  | -       | -       | 0       | ROB 15

Divider Reservation Stations
Busy  | Op      | Qj      | Qk      | Vj      | Vk      | Dest  
-----------------------------------------------------------------

Reorder Buffer [ROB : (Head => Tail) OR (Oldest => Newest)]
Entry   | Type    | Dest      | Value     | Done  
--------------------------------------------------
ROB 10  | ADD     | R20       | 0         | Yes   
ROB 11  | MUL     | R21       | -         | No    
ROB 12  | LOAD    | R22       | 0         | Yes   
ROB 13  | LOAD    | R23       | 0         | Yes   
ROB 14  | ADD     | R24       | -         | No    
ROB 15  | MUL     | R25       | -         | No    
ROB 0   |         | R0        | 0         | Yes   

Load/Store Queue [LSQ : (Head => Tail) OR (Oldest => Newest)]
Entry   | Type    | EA    | Value     | Dest      | Offset   | Base Reg Value    | Ready   | CDB Write   
---------------------------------------------------------------------------------------------------------
LSQ 6   | LOAD    | 0     | 0         | ROB 12    | 0        | 0                 | YES     | DONE        
LSQ 7   | LOAD    | 0     | 0         | ROB 13    | 0        | 0                 | YES     | DONE        

Register File & Status
 R0     | R1     | R2     | R3     | R4     | R5     | R6     | R7     | R8     
--------------------------------------------------------------------------------
 Ready  | Ready  | Ready  | Ready  | Ready  | Ready  | Ready  | Ready  | Ready  
 0      | 0      | 0      | 0      | 0      | 0      | 0      | 0      | 0      

 Memory State
-

----------------------------------------------------------------------------------------------- Cycle 24 ---


Future Work

Several directions can extend this simulator for deeper exploration:

  • Branch Prediction: Add static/dynamic prediction schemes and study misprediction penalties.
  • Multi-Core Support: Extend the simulator to model multi-core with a shared cache hierarchy and coherence.
  • Visualization Enhancements: Web-based dashboards for pipeline analysis.
  • Superscalar: Add multi-issue and multiple CDB write support for increasing IPC>1.