- RasteriCEr
- Working games
- How to build for the iCE40UP5K
- How to build for the Xilinx Artix-7 35 (Arty)
- How to build the simulation
- Add the driver to your IDE
- Design
- Port to a new platform
- Missing features
- Technical limitations
- Possible performance improvements in the software part
The RasteriCEr is a rasterizer implementation mainly for an iCE40 FPGA. It can be used with micro controllers and offers an API similar to OpenGL. The main goal is/was to implement the whole fixed function pipeline of OpenGL 1.5 / OpenGL ES1.1 and that it is faster than a software implementation. I am confident that at least the second goal is achieved, but for the first one, a bigger FPGA is required. Currently some features like stencil buffer, fog, second texture unit, texture filtering and so on are missing. Currently it has more the nature of a prototype and maybe a prove of concept. Hopefully this software is somehow useful for someone or is at least good enough for education or inspiration.
I tried tuxracer in an simulation. It works and produces an playable image. Because of missing smooth shading, the texture transitions have a hard cut (see the cut between ice and snow). Fog planes and lighting also suffers from this fact. Missing texture filtering causes an pixelated look.
First you have to install the following tools:
- cliffords icestrom
- Yosys 0.9 (git sha1 UNKNOWN, clang 11.0.3 -fPIC -Os)
- nextpnr-ice40 -- Next Generation Place and Route (Version 4512a9de)
Now execute the following steps
git clone https://github.com/ToNi3141/RasteriCEr.git
cd RasteriCEr/rtl/top/iCE40up5k
make
make flash
This will by default build a bin file which expects a display with the resolution of 240x320. To use other resolutions you can set the following variables:
X_RESOLUTION
Y_RESOLUTION
Y_LINE_RESOLUTION
X_RESOLUTION
and Y_RESOLUTION
are defining the resolution of the whole screen. The iCE40 uses internally a 32kB memory for the color and an 32kB memory for the depth buffer. Because of that, only small screen resolutions can fit in the buffers. To support higher resolutions, Y_LINE_RESOLUTION
can be used to divide the frame in several smaller frame lines. For instance, a screen resolution of 240x320x2 is requiring a roughly 5 times bigger frame buffer, than the iCE40 can handle. If we now divide the Y_RESOLUTION
by 5 and use this resolution for the Y_LINE_RESOLUTION
, we get away with smaller local buffers and are still able to render an high resolution image. An example configuration could look like this:
make X_RESOLUTION=240 Y_RESOLUTION=320 Y_LINE_RESOLUTION=64
Please not that a glClear()
will behave now differently when it is omitted. Normally, when you omit this, you would see the last frame. The RasteriCEr will show you an echo of the previous line.
Other iCE40 FPGAs are not working because they lack internal memory.
- 32kB Texture buffer, supported resolutions: 128x128, 64x64, 32x32.
- Perspective correct textures
- Transparency effects
- Runs at 24MHz, maximum fill rate: 12MPixel (limited by the SPRAM, we can only read and write every other pixel. With dual port ram, we could increase the fill rate to one pixel per clock. The fill rate does not depend on the different render modes. Transparency, depth buffering and so on are not reducing the fill rate. The fill rate suffers from much small triangles (because of a few clock cycles setup time for each triangle) and by switching textures (each texture has to be completely uploaded to the hardware before it can continue with rendering).
- Displays can be directly connected to the iCE40. The top module implements a hardware, which serializes the color buffer and sends it via SPI to the display.
- Command SPI speed: up to 48MHz
- Display SPI speed: up to 24MHz
- 16 bit color (RGBA4444)
- 16 bit depth buffer (w buffer)
Info: Device utilisation:
Info: ICESTORM_LC: 4901/ 5280 92%
Info: ICESTORM_RAM: 24/ 30 80%
Info: SB_IO: 13/ 96 13%
Info: SB_GB: 8/ 8 100%
Info: ICESTORM_PLL: 0/ 1 0%
Info: SB_WARMBOOT: 0/ 1 0%
Info: ICESTORM_DSP: 3/ 8 37%
Info: ICESTORM_HFOSC: 1/ 1 100%
Info: ICESTORM_LFOSC: 0/ 1 0%
Info: SB_I2C: 0/ 2 0%
Info: SB_SPI: 0/ 2 0%
Info: IO_I3C: 0/ 2 0%
Info: SB_LEDDA_IP: 0/ 1 0%
Info: SB_RGBA_DRV: 0/ 1 0%
Info: ICESTORM_SPRAM: 4/ 4 100%
...
Info: Max frequency for clock 'clk': 25.04 MHz (PASS at 24.00 MHz)
Info: Max frequency for clock 'serial_sck$SB_IO_IN_$glb_clk': 102.01 MHz (PASS at 12.00 MHz)
PMOD A command signals
Signal | PMOD | IceBreaker | Raspberry Pi Pico | Arduino |
---|---|---|---|---|
resetn | 9 | 46 | 12 | 6 |
mosi | 2 | 2 | 3 | 11 |
sck | 1 | 4 | 2 | 10 |
cs | 8 | 48 | 11 | 5 |
cts | 10 | 44 | 13 | 7 |
tft mux | 7 | 3 | 10 | 4 |
tft cs | 4 | 45 | 9 | 9 |
tft dc | 3 | 47 | 8 | 8 |
If the tft mux
is asserted, it will multiplex the mosi
, sck
, tft cs
and tft dc
signals through the PMOD B signals. The MCU can directly communicate with the display, configure it, initialize it and so on.
PMOD B display signals
Signal | PMOD | IceBreaker |
---|---|---|
sck | 1 | 43 |
mosi | 2 | 38 |
reset | 8 | 36 |
dc | 9 | 32 |
cs | 7 | 42 |
cd rtl/top/Xilinx
/Xilinx/Vivado/2020.1/bin/vivado -mode batch -source build.tcl
Note: Because you will probably use another vivado version. The build script uses a clock wizard which creates from the 100MHz oscillator on the Arty board a 90MHz clock. The design could probably also run with 100MHz and produce 100MPixel but that timing is really tight. As a quick and dirty solution, you can remove the clk_wiz_0
from the build.tcl
and top.v
and use the 100MHz clock directly. It could work.
By default it runs with 90MHz and renders with 90MPixel. By side that, it uses the same configuration like the ice40 build and the same pin out (PMOD JA for the command signals and PMOD JB for the display).
The simulation uses Verilator 4.200 2021-03-12 rev v4.110-43-g5022e81af (mod) to generate C++ code.
cd rtl/top/Verilator
make
After that, you can go to unittest/qtRasterizer
and open the Qt project. This is a small simulation which renders an image (similar to the Arduino example) onto your screen.
It is likely, that your verialtor installation has another path as it is configured in the qtRasterizer.pro
file. Let the variable VERILATOR_PATH
point to your verilator installation and rebuild the project.
You can find under arduino/rasterizer an example how to use the driver. Before you can use the driver, you have to copy all files in the lib/gl directory into the arduino library directory (if you are using the Arduino IDE). If you use another IDE add this files to your build system.
mkdir /<pathToArduinoLibDir>/Arduino/libraries/IceGL
cp -R /<pathToLib>/RasteriCEr/lib/gl/* /<pathToArduinoLibDir>/Arduino/libraries/IceGL
or you can create a symbolic link
ln -s /<pathToLib>/RasteriCEr/lib/gl/ /<pathToArduinoLibDir>/Arduino/libraries/IceGL
The following diagram shows roughly the flow a triangle takes, until it is seen on the screen.
For the s_cmd_axis command specification, please refer rtl/RasteriCEr/RegisterAndDescriptorDefines.vh
The driver is build with the following components:
IceGLWrapper
: Wraps the IceGL API to a C API.IceGL
: The library which implements an OpenGL like API. The API is not complete, but everything which is programmed for IceGL should be compatible to OpenGL. If not, then it is a bug.TnL
: Implements the geometry transformation, clipping and lighting.Rasterizer
: This basically is the rasterizer. It implements the edge equation to calculate barycentric coordinates and also calculates increments which are later used in the hardware to rasterize the triangle. This is also done for texture coordinates and w.IRenderer
: Defines an interface fromIceGL
to an renderer. You can use this interface to implement a software renderer or a better version of the Renderer.Renderer
: Implements the IRenderer interface, executes the rasterization, compiles display lists and sends them via theIBusConnecter
to the RasteriCEr.DisplayList
: Contains all render commands produced from the Renderer and buffers them, before they are streamed to the RasteriCEr.IBusConnector
: This is an interface from the driver to to the RasteriCEr. Implement here your SPI driver. Alternatively, if you implement the RasteriCEr in you own FPGA system, implement here how to stream data from and to the RasteriCErs AXIS ports.
Note hat the blue boxes are specific for the iCE40 build. If you want to integrate it in your own system, only the parts in the RasteriCEr are relevant for you.
SPI_Slave
(3rd party): Implements a SPI slave to deserialize the data from the SPI bus.Serial2AXIS
: Converts the deserialized data from the SPI_Slave into an AXIS data stream.RasteriCEr
: Basically implements the RasteriCEr. It has an CMD_AXIS port where it receives the commands to render triangles, set render modes, upload textures and so on. It also has an FRAMEBUFFER_AXIS port where it streams out the data from the color buffer. Alternatively both AXIS ports can also be connected to other devices like DMAs, if you want to integrate the RasteriCEr in your own project.CommandParser
: Reads the data from the CMD_AXIS port, decodes the commands and controls the RasteriCEr.Rasterizer
: Takes the triangle parameters from the Rasterizer class (see the section in the Software) and rasterizes the triangle by using the precalculated values/increments.FragmentPipeline
: Consumes the fragments from the Rasterizer, does perspective correction, depth test, blend and texenv calculations, texture clamping and so on.TextureBuffer
: Buffers the texture.ColorBuffer
: Contains the color buffer.FrameBuffer
: Contains the depth buffer.DisplayControllerSPI
: Contains an internal buffer with the size of the FrameBuffer and serializes the data for an SPI display.
To port the driver to a new MCU only a few steps are required.
- Create a new class which derived from the
IBusConnector
. Implement implement the virtual methods. This is also performance critical. Use (if possible) non blocking methods. Otherwise the rendering is slowed down because the data transfer blocks further processing. - Use this class for the Renderer.
- Add the whole
lib/gl
directory to your build system. - Build
Now you are done with your port.
Note: A FPU can improve the performance drastically (around 5 times more triangles are possible).
- Connect the command AXIS channel (
s_cmd_axis
) to an device, which streams the data from theIBusConnector
. - Connect the AXIS channel from the color buffer (
m_framebuffer_axis
) to an device, which can handle the color buffer. - Connect
nreset
to your reset line andaclk
to your clock domain. - Add everything in
rtl/RasteriCEr
to your build system. - Synthesize.
The following features are currently missing compared to a real OpenGL implementation
- Texture filtering
- Smooth shading
- Stencil buffer
- Fog
- Rasterization of lines an points
The rasterizer uses for the depth buffer only the w component. It does not remap w to z. That means, glDepthRange()
has no effect.
The zNear plane must be bigger than 0.5. Lower values will overflow.
The zFar plane must be smaller than 128.0. Higher values will overflow.
The reason: The interpolated w component is the reciprocal of the vertex depth and for this we use a S1.30 number. The maximum value of this number will be a 1.999... If zNear is equal or smaller than 0.5, then the reciprocal will easily exceed the number range (it will be 2.0 or bigger).
During perspective correction on the FPGA, the S1.30 (reciprocal of the vertex depth) number will be converted into a U1.15 number. This U1.15 number is the divisor of a Un.24 number (calculating now the reciprocal of the interpolated reciprocal vertex depth), the result will be a U7.9 number (because the multipliers in the iCE40 can only handle 16 bit multiplications). The maximum value of this number is 127.9999... Therefore the zFar plane should never exceed 128.0 because then it will overflow the corrected w.
By side that, the texture interpolation can also be affected by precision. For perspective correct textures, the texture coordinates have to be divided by w. If we have now an object, which is far away, the texture coordinates from this object will be divided by a big w. That means, lower significant bits are getting more important. The texture itself is interpolated as a S1.30 value, but during perspective correction, it is converted to a S1.14 number (the iCE40 only has 16 bit hardware multipliers) and here we lose a lot of precision. This value is then multiplied with w again and produces an unprecise value.
The rasterizer currently uses S1.30 numbers for the vertex attributes. This has a big influence on GL_REPEAT
and GL_CLAMP_TO_EDGE
because texture coordinates bigger than 2.0 will overflow. Over/Underflowing is not a real issue when we use GL_REPEAT
on the hardware, because GL_REPEAT
is defined to just cut off the integer part and only use the fractional part. It looks different when we use GL_CLAMP_TO_EDGE
. To clamp, we have to know, from where we are coming and if we are in our valid range (0.0 - 1.0). If an overflow occurs, we have no glue if we are coming from e.g. 10.0 or 2.0, so we have no glue when to draw the border. Or in other words: For clamping, we require the integer part, which right now has only one bit.
It is not really possible to reduce the fractional part of the number and add more integer bits. We would loose precision on the texture interpolation, especially when we have a big object which is far away. This would result in a drifting texture. We could fix this by adding a wider register with a wider multiplication, but at least with the iCE40, this is not possible. This problem is similar to the one with the zNear and zFar planes.
Currently there is a bug when we use lots of repetitions on big objects (object which is in object space bigger than 1.0 peer axis). Currently i don't know what happens there, but the textures are starting to move. I have still to investigate that. What confuses a bit, the behavior is different when using the fixPoint rasterization or the floating point one. That suggests, that it has something to do with the parameter calculation and not with the hardware. Use the following code to reproduce:
float planeS[4] = {8.5, 0, 0, 0};
float planeT[4] = {0, 8.5, 0, 0};
glTexGeni(GL_S, GL_TEXTURE_GEN_MODE, GL_OBJECT_LINEAR);
glTexGeni(GL_T, GL_TEXTURE_GEN_MODE, GL_OBJECT_LINEAR);
glTexGenfv(GL_S, GL_OBJECT_PLANE, planeS);
glTexGenfv(GL_T, GL_OBJECT_PLANE, planeT);
glEnable(GL_TEXTURE_GEN_S);
glEnable(GL_TEXTURE_GEN_T);
Probably this is a precision problem.
The hardware right now only supports a w buffer. When we have a matrix, which does not use the w component of a vector for the depth, then the depth buffer will not work. This can be observed with glOrtho()
. This matrix sets the w component of the vector to 1, doesn't matter which z value the vertex has, w stays 1.
- TnL Vertex Transformations: Normally, when we use vertex arrays, we could take the arrays, transform all vertices and normals and then start calculating light and so on. The reason why this is not done is memory. Currently i have embedded systems in mind and i don't want to allocate too much memory. The tradeoff is now that, we have to calculate the transformation several times, but we can save memory.
- Renderer DisplayList handling: Currently, we have one display list which represents the whole screen. But as earlier written, we have so called display lines. Typically a image is composed of several of this lines (similar to a tiled renderer). Right now, every time there is some space in the FiFo, we are iterating through the screen display list, searching for geometry for the current display line and compiling a new display list. This is slow and it is getting slower when the scene uses a lot of geometry. Also textures will be blindly loaded, even if there is no object which uses the texture in the current line. To improve that, we could discard the one display list for one screen approach and implement a display list for each line. This lists will be simultaneously filled when there is geometry which affects the line. Then we don't have to iterate several times though a big display list. This saves CPU time but increases the memory consumption (and again, we don't want this). To improve that even more, when we use a hardware flow control and a scatter/gather DMA, we can completely offload the display list upload from the CPU.
- Matrix handling: The handling of the matrices is not as good as it could be. Currently, if one matrix has changed, we recalculate all other matrices including the inverse of the model view matrix (which is used for the normal interpolation). This takes a lot of time and could sometimes omitted.
- Float vs fix point: Currently the library uses extensively floating point arithmetic. On a MCU with FPU, this is not a problem, but on MCUs without FPU, this slows down the calculations several times.