Skip to content

Commit 09b7c2b

Browse files
committed
Added concluding remarks.
1 parent 7373542 commit 09b7c2b

File tree

2 files changed

+78
-6
lines changed

2 files changed

+78
-6
lines changed

FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/PIPELINE.md

Lines changed: 77 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2210,7 +2210,7 @@ to `DIV` instructions), then press `<return>` to run cycle by cycle.
22102210
The displayed division mask shows progress.
22112211

22122212

2213-
## Step X: optimizing for fmax
2213+
## Step 10: optimizing for fmax
22142214

22152215
So we have seen that we have gained something regarding CPI, but were do
22162216
we stand for fmax (and also LUTs and FFs) ? Here are the values for
@@ -2228,7 +2228,7 @@ the ULX3S:
22282228
- the "sequential pipeline" validates at 80 MHz. Then we see the
22292229
benefit of having simple stages
22302230
- Fmax quickly drops when pipeline control logic is added (pipeline4)
2231-
- ... but it gets higher with pipeline5 and pipeline5_vis that use
2231+
- ... but it gets higher with pipeline5 and pipeline5_bis that use
22322232
a combinatorial register file (emulated or real) where a written
22332233
value can be read in the same cycle (hence pipeline control logic
22342234
is simpler)
@@ -2237,9 +2237,11 @@ the ULX3S:
22372237
But we made no effort optimizing Fmax, our goal up to now was mainly to
22382238
reduce CPI. Let us see what we can do now.
22392239

2240+
There are several things that we can do:
2241+
22402242
### Read `DATARAM` in `E` stage, sign-extend and align in `M`
22412243

2242-
_WIP_
2244+
Splitting memory operations over multiple stages reduces the critical path.
22432245

22442246
### `wbEn` register pipeline
22452247

@@ -2280,10 +2282,80 @@ do the `!isStore(I) & !isBranch(I)` test, then the execute test will test that
22802282
`rdId` is not 0.
22812283

22822284
### `D` decodes, and remove `instr` and `PC` from subsequent stages
2283-
_WIP_
2285+
2286+
We propagate `instr` from stage to stage, and decode it each time we
2287+
need. It is suboptimal, it is better to recognize the different
2288+
instructions in `D`, and propagate `is_xxxx` flags that recognize
2289+
each instruction.
22842290

22852291
### Careful optimizations in `D`
2286-
_WIP_
2292+
2293+
The choice for the bit pattern encoding of the 10 main instruction classes
2294+
is not arbitrary, it was designed with efficiency in mind. The most obvious
2295+
thing is the last two bits, that are always `00`, that one does not need
2296+
to test, but there is also some more subtle structure:
2297+
2298+
For instance,
2299+
`JAL` is the only instruction that has its bit 3 set, so `D` does not need
2300+
to test the other bits. It is interesting, because it simplifies the PC
2301+
prediction logic.
2302+
2303+
### Pipelined register Id comparator for register forwarding
2304+
2305+
The two three-way-muxes at the beginning of `E` are driven by comparisons between
2306+
the source regster ids (`rs1Id` and `rs2Id`) and the destination register
2307+
Id `rdId`) in (`DE`,`EM`) and (`DE`,`MW`). The result of these two tests can be
2308+
computed one cycle in advance in `D` and stored in 4 flipflops
2309+
`DE_rs1Id_eq_EM_rdId`, `DE_rs1Id_eq_MW_rdId`, `DE_rs2Id_eq_EM_rdId` and
2310+
`DE_rs2Id_eq_MW_rdId`.
2311+
2312+
The "final product", called `TordBoyau`,
2313+
is available in [this project](https://github.com/BrunoLevy/TordBoyau). On an ARTY
2314+
using Vivado, it validates at 100-125 MHz (and can be successfully overclocked up
2315+
to 140 MHz).
2316+
- With the RV32I configuration, the raytracing test achieves 7.375 raystones
2317+
(and 1.092 CPI, gshare + RAS works very well !).
2318+
- With the RV32IM configuration, it reaches 18.215 raystones.
2319+
2320+
## Epilogue
2321+
2322+
Hope you enjoyed this series. There are many other topics to study, and I will prepare
2323+
(on day) tutorials on the following topics (as soon as I understand them !)
2324+
2325+
- **optimizations**: there are several things we can do to make our processor even faster.
2326+
First thing is `STORE`->`LOAD` register forwarding, to make `memcpy()` run at 1 cycle
2327+
per word. Second thing is with the `RV32IM` version of `TordBoyau` that validates around
2328+
80 MHz (whereas it can be safely overclocked up to 140 MHz), so there are probably some
2329+
false paths that need to be elimitated.
2330+
2331+
- **cache**: for now, our pipelined processor has a `PROGROM` and a `DATARAM`. We should plug
2332+
on them a cache interface, connected to an SDRam controller, that fetched and stores data
2333+
to the SDRAM as need be. I plan to develop that in the LiteX system, that has super complete
2334+
functionalities (SDRam controller, frame buffer etc...). Then we will be able to run DOOM
2335+
on our pipelined processor (DOOM already runs with the core that we built in Episode I because
2336+
LiteX folks already interfaced it to their cache controller).
2337+
2338+
- **interrupts**: the RISC-V ISA has several chapters. There is a "priviledged ISA", with
2339+
special registers and instructions to control interrupts and traps. However, the official
2340+
documentation of the ISA is very difficult to read (I think), because it lists every
2341+
possibility. I project to write a small description of the bare minimum needed to run
2342+
for instance Linux-noMMU.
2343+
2344+
- **MMU**: talking about MMU, it is already an interesting topic. @ultraembedded told me it
2345+
is super simple to add.
2346+
2347+
- **Out-of-order**: at the end of this episode, thanks to the combined effect of
2348+
gshare branch prediction and the return address stack, we achieved 1.092 CPIs,
2349+
which is very near the "speed of light" (1 CPI). In fact, it is possible to design
2350+
"faster than light" processor, with several execution units, that pick instructions
2351+
and execute them out of order (OoO), executing **several instructions per cycle**.
2352+
There are several cores that do that, such as NaxRiscV by @dolu1990 and
2353+
BiRiscV by @ultraembedded. The micro-architecture is completely different, still
2354+
a set of pipelines, but organized as a tree, with a "central authority" that
2355+
routes instructions (Tomasulo algorithm). Would be great to have a generic
2356+
design, where one could create his own tree of pipelines, and have code to
2357+
automatically generate the "central authority". A framework like LiteX could
2358+
be used for that.
22872359

22882360
## References
22892361

LiteX/software/DemoBundle/demos/oled_test.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#include <libbase/console.h>
77

88
static void oled_test(int nb_args, char** args) {
9-
uint32_t frame;
9+
uint32_t frame=0;
1010
puts("Press any key to exit");
1111
oled_init();
1212
oled_write_window(0,0,OLED_WIDTH-1,OLED_HEIGHT-1);

0 commit comments

Comments
 (0)