@@ -2210,7 +2210,7 @@ to `DIV` instructions), then press `<return>` to run cycle by cycle.
2210
2210
The displayed division mask shows progress.
2211
2211
2212
2212
2213
- ## Step X : optimizing for fmax
2213
+ ## Step 10 : optimizing for fmax
2214
2214
2215
2215
So we have seen that we have gained something regarding CPI, but were do
2216
2216
we stand for fmax (and also LUTs and FFs) ? Here are the values for
@@ -2228,7 +2228,7 @@ the ULX3S:
2228
2228
- the "sequential pipeline" validates at 80 MHz. Then we see the
2229
2229
benefit of having simple stages
2230
2230
- Fmax quickly drops when pipeline control logic is added (pipeline4)
2231
- - ... but it gets higher with pipeline5 and pipeline5_vis that use
2231
+ - ... but it gets higher with pipeline5 and pipeline5_bis that use
2232
2232
a combinatorial register file (emulated or real) where a written
2233
2233
value can be read in the same cycle (hence pipeline control logic
2234
2234
is simpler)
@@ -2237,9 +2237,11 @@ the ULX3S:
2237
2237
But we made no effort optimizing Fmax, our goal up to now was mainly to
2238
2238
reduce CPI. Let us see what we can do now.
2239
2239
2240
+ There are several things that we can do:
2241
+
2240
2242
### Read ` DATARAM ` in ` E ` stage, sign-extend and align in ` M `
2241
2243
2242
- _ WIP _
2244
+ Splitting memory operations over multiple stages reduces the critical path.
2243
2245
2244
2246
### ` wbEn ` register pipeline
2245
2247
@@ -2280,10 +2282,80 @@ do the `!isStore(I) & !isBranch(I)` test, then the execute test will test that
2280
2282
` rdId ` is not 0.
2281
2283
2282
2284
### ` D ` decodes, and remove ` instr ` and ` PC ` from subsequent stages
2283
- _ WIP_
2285
+
2286
+ We propagate ` instr ` from stage to stage, and decode it each time we
2287
+ need. It is suboptimal, it is better to recognize the different
2288
+ instructions in ` D ` , and propagate ` is_xxxx ` flags that recognize
2289
+ each instruction.
2284
2290
2285
2291
### Careful optimizations in ` D `
2286
- _ WIP_
2292
+
2293
+ The choice for the bit pattern encoding of the 10 main instruction classes
2294
+ is not arbitrary, it was designed with efficiency in mind. The most obvious
2295
+ thing is the last two bits, that are always ` 00 ` , that one does not need
2296
+ to test, but there is also some more subtle structure:
2297
+
2298
+ For instance,
2299
+ ` JAL ` is the only instruction that has its bit 3 set, so ` D ` does not need
2300
+ to test the other bits. It is interesting, because it simplifies the PC
2301
+ prediction logic.
2302
+
2303
+ ### Pipelined register Id comparator for register forwarding
2304
+
2305
+ The two three-way-muxes at the beginning of ` E ` are driven by comparisons between
2306
+ the source regster ids (` rs1Id ` and ` rs2Id ` ) and the destination register
2307
+ Id ` rdId ` ) in (` DE ` ,` EM ` ) and (` DE ` ,` MW ` ). The result of these two tests can be
2308
+ computed one cycle in advance in ` D ` and stored in 4 flipflops
2309
+ ` DE_rs1Id_eq_EM_rdId ` , ` DE_rs1Id_eq_MW_rdId ` , ` DE_rs2Id_eq_EM_rdId ` and
2310
+ ` DE_rs2Id_eq_MW_rdId ` .
2311
+
2312
+ The "final product", called ` TordBoyau ` ,
2313
+ is available in [ this project] ( https://github.com/BrunoLevy/TordBoyau ) . On an ARTY
2314
+ using Vivado, it validates at 100-125 MHz (and can be successfully overclocked up
2315
+ to 140 MHz).
2316
+ - With the RV32I configuration, the raytracing test achieves 7.375 raystones
2317
+ (and 1.092 CPI, gshare + RAS works very well !).
2318
+ - With the RV32IM configuration, it reaches 18.215 raystones.
2319
+
2320
+ ## Epilogue
2321
+
2322
+ Hope you enjoyed this series. There are many other topics to study, and I will prepare
2323
+ (on day) tutorials on the following topics (as soon as I understand them !)
2324
+
2325
+ - ** optimizations** : there are several things we can do to make our processor even faster.
2326
+ First thing is ` STORE ` ->` LOAD ` register forwarding, to make ` memcpy() ` run at 1 cycle
2327
+ per word. Second thing is with the ` RV32IM ` version of ` TordBoyau ` that validates around
2328
+ 80 MHz (whereas it can be safely overclocked up to 140 MHz), so there are probably some
2329
+ false paths that need to be elimitated.
2330
+
2331
+ - ** cache** : for now, our pipelined processor has a ` PROGROM ` and a ` DATARAM ` . We should plug
2332
+ on them a cache interface, connected to an SDRam controller, that fetched and stores data
2333
+ to the SDRAM as need be. I plan to develop that in the LiteX system, that has super complete
2334
+ functionalities (SDRam controller, frame buffer etc...). Then we will be able to run DOOM
2335
+ on our pipelined processor (DOOM already runs with the core that we built in Episode I because
2336
+ LiteX folks already interfaced it to their cache controller).
2337
+
2338
+ - ** interrupts** : the RISC-V ISA has several chapters. There is a "priviledged ISA", with
2339
+ special registers and instructions to control interrupts and traps. However, the official
2340
+ documentation of the ISA is very difficult to read (I think), because it lists every
2341
+ possibility. I project to write a small description of the bare minimum needed to run
2342
+ for instance Linux-noMMU.
2343
+
2344
+ - ** MMU** : talking about MMU, it is already an interesting topic. @ultraembedded told me it
2345
+ is super simple to add.
2346
+
2347
+ - ** Out-of-order** : at the end of this episode, thanks to the combined effect of
2348
+ gshare branch prediction and the return address stack, we achieved 1.092 CPIs,
2349
+ which is very near the "speed of light" (1 CPI). In fact, it is possible to design
2350
+ "faster than light" processor, with several execution units, that pick instructions
2351
+ and execute them out of order (OoO), executing ** several instructions per cycle** .
2352
+ There are several cores that do that, such as NaxRiscV by @dolu1990 and
2353
+ BiRiscV by @ultraembedded . The micro-architecture is completely different, still
2354
+ a set of pipelines, but organized as a tree, with a "central authority" that
2355
+ routes instructions (Tomasulo algorithm). Would be great to have a generic
2356
+ design, where one could create his own tree of pipelines, and have code to
2357
+ automatically generate the "central authority". A framework like LiteX could
2358
+ be used for that.
2287
2359
2288
2360
## References
2289
2361
0 commit comments