-
Notifications
You must be signed in to change notification settings - Fork 0
MEMZERO and MEMCOPY instructions proposal
MEMZERO RD:size/offset, RS2:to_address
MEMCOPY RD:size/offset, RS1:from_address, RS2:to_address
Used in loops as follows
memzero:
x1 := nbytes
x2 := start_address
loop:
MEMZERO, x1,x2,x1
BGTZ x1, loop
memcopy:
x1 := nbytes
x2 := copy_from_start_address
x3 := copy_to_start_address
loop:
MEMCOPY, x1,x2,x3
BGTZ x1, loop
This uses the form suggested by Krste Asanovic, defined below.
Depending on details of instruction, the BGTZ above can be a BEQ (BEQZ).
The simpler to explain form that uses ranges defined as [end-size,end) requires one or two more instructions before the loop starts.
IMHO the easiest form to explain of address based range instructions is to use ranges defined as [ebd-size,end)
-
single address range like MEMZERO or various cache flush is
- SINGLE_ADDRESS_RANGE( src/dest RD, src RS1, ... )
- RS1 contains the highest address of the range
- RD contains the number of bytes in the range
- RD is a source/dest register, both input and output
- i.e. the range is [RS1-RD,RS1)
- yes, this seem strange
- SINGLE_ADDRESS_RANGE( src/dest RD, src RS1, ... )
-
double address range like MEMCOPY
- DOUBLE_ADDRESS_RANGE( src/dest RD, src RS1, src RS2 ... )
- RS1 contains the highest address of the first range
- RS2 contains the highest address of the second range
- RD contains the number of bytes in both ranges, which is assumed to be equal
- RD is a source/dest register, both input and output
- i.e. the two ranges are [RS1-RD,RS1), [RS2-RD,RS2)
- DOUBLE_ADDRESS_RANGE( src/dest RD, src RS1, src RS2 ... )
Yes, it seems strange that the ranges are defined [hi-size,hi), rather than [lo,lo+size) or [lo,hi). But AFAIK this is the only way to obtain
- exception transparency for a state machine implementation of a two address range instruction
- while allowing the memory range to be processed from low address the high address
- [lo,lo+size) would require processing from high address to low address, which is often not desirable
- and requiring only a single register to be written by such an instruction
- Albeit a source/dest register, which is unavoidable for a state machine implementation with exception transparency.
If you are willing to tolerate inconsistency between single address and double address forms, then a slightly more pol laughable version is possible, but only for single address forms
- SINGLE_ADDRESS_RANGE( src/dest RD, src RS1, ... )
- RS1 contains the highest address of the range
- RD contains the low address of the range
- RD is a source/dest register, both input and output
- i.e. the range is [RD,RS1)
Me, I like consistency, and I hope to see a to address MEMCOPY instruction eventually defined for RISC-V.
Krste Asanovic suggests the [start+offset,start+size). Initially, offset is zero, giving the familiar [start,start+size) form of range. however, as a result of partial completion, offset may be updated to indicate how much of the block memory operation has already been performed, so that you can resume in the middle.
It would be straightforward to define [start+offset,start+size) using three register operands: RD=source/dest offset, RS1=start, RS2=size.
however, risk machines like RISC-V prefer to avoid three input register operands. moreover, MEMCOPY would require for operands in that case.
Therefore Krste suggests packing both the size and the offset into a single register operand.
RD=source/dest * count in lower half, buts RD[XLEN/2-1:0] * offset in upper half, bits RD[XLEN-1:XLEN]
RS1=start of address range to write
RS2=start of address range to read from and copy to the RS1 based address range.
Classic RISC instruction sets are load/store. They do not have instructions that scan over or copy memory regions. CISCs, on the other hand, have such instructions, like Intel REP STOS and REP MOVS, IBM mainframe MVC (Move Characters).
A classic RISC approach to instruction set design is to define cache line oriented instructions, e.g. that might fetch, evict, or even zero an entire cache line at a time. If you want to apply this to a range, iterate. However, this exposes the cache line size, and is particularly important when zeroing memory.
However, sometimes range oriented instructions can be more efficient than doing things a cache line at a time. Sometimes people will build efficient state machines or external copy engines. Sometimes these range operations can use cache protocol features that would be unsafe to expose as arbitrary user instructions (e.g. Intel's "fast strings").
Yes: I mean implementations, microarchitectures. In the sense that we want to define and instruction set architecture facility that permits at least these three reasonable implementations.
-
enable per address per cache line stuff - Like IBM POWER DCBZ
-
trap and emulate
-
possibly allow state machine-based implementations
- whether inside or outside the CPU #v Instruction design issues
- anything that needs to be updated so that the instruction can pick up where left off must be source/dest.
- Which basically means that anything that the loop in #1 needs to update must be considered a source/dest operand if a state machine implementation is to be enabled
I would definitely like to enable #3 state machine implementations. Perhaps not right now, but perhaps in the future. Perhaps not for CMOs, but perhaps for ZALLOC/MEMZERO and possibly for MEMCOPY.
I mention MEMCOPY, which involve two address ranges of the same size, as well as the single address range CMOs and MEMZERO, because maintaining "compatibility" between single address MEMZERO and double address range MEMCOPY motivates a particular binding of register operands. Many people seem to dislike this design of address range instructions, although I hope it is just unfamiliarity, and wishing for an imaginary and impossible alternative
I would like to enable #3 state machine implementations as cheaply as possible. Which, to me, means that it would be nice for them to stay in the RISC mentality of only having a single destination register. As far as I know there are two definitions of single address range operations, but only one definition of double address range operations like MEMSET: