- machine instruction
- assembly source code file
- relocatable file
- executable file
- The CPU can only decode and execute machine instructions.
- An assembly source code file consists of assembly instructions, which are symbolic representations of machine instructions.
- The assembler can translate an assembly source file into a relocatable file.
- A relocatable file contains machine instructions and some meta information.
- A relocatable file can't be loaded and executed directly, because it may reference some symbols whose address values are not known yet.
- Multiple relocatable flles can be linked to resolve unknown symbols and form an executable file.
- An executable file contains machine instructions to be executed, and its meta information specifies the memory address where the program should be loaded.
In this section, we learn how an assembly source file is converted into an executable file step by step, and why does a symbol with an unknown address value exist?
Take the following assembly source code file test.s
as an example.
.text
/* load a immediate value to t0 */
lui t0, %hi(0x12345678) /* higher 20 bits */
addi t0, t0, %lo(0x12345678) /* lower 12 bits */
/* load a symbol value to t0 */
lui t0, %hi(hello_string)
addi t0, t0, %lo(hello_string)
/* define a string */
hello_string:
.string "Hello, RISC-V!"
The source file performs three operations:
- Load an
immediate number
into the t0 register with two instructionslui
andaddi
- Load a symbolic
hello_string
into the t1 register with two instructionslui
andaddi
- Define a string with the symbol
hello_string
In short, assembly is the process of reading assembly instructions from assembly source file one by one and translating them into machine instructions.
What should the result look like after assembly? Let's analyse it.
- The third operation is just a string definition, the assembler just needs to convert it to ASCII.
- For the first operation, whose operand is an immediate number, the assembler can generate the corresponding machine instruction exactly.
However, the second statement contains a symbol hello_string
. The value of the symbol actually represents a memory address where the data or instructions following the symbol in assembly source code should be loaded upon execution.
The problem is that when assembling, the assembler doesn't know where the data or instructions which followed a symbol should be loaded upon execution, thus don't know the symbol value.
In addition to the local symbols, an assembly file can also reference symbols in other assembly files. These symbols obviously cannot be resolved when assembling an assembly source file.
At assembly time, some symbols are not immediately resolved. When the assembler meets a assembly instruction with a unresolved symbol, it treats the symbol value as zero
, then translates the assembly instruction to a machine instruction. Obviously, the data bits representing the unresolved symbol are also zero
in the translated machine instructions.
To ensure that the unresolved symbols are finally resolved correctly, the assembler places additional auxiliary information in the generated target file. These auxiliary information are used to assist in symbol resolution or symbol relocation, so the target files generated by the assembler are called relocatable files.
We can assemble the example file to a relocatable file with riscv64-linux-gnu-as
.
riscv64-linux-gnu-as test.s -o test.o
The above command assembles test.s
and generates the relocatable file test.o
.
Let's explore test.o
!
With command riscv64-linux-gnu-objdump -S test.o
, we can disassemble test.o
.
$ riscv64-linux-gnu-objdump -S test.o
test.o: file format elf64-littleriscv
Disassembly of section .text:
0000000000000000 <hello_string-0x10>:
0: 123452b7 lui t0,0x12345
4: 67828293 addi t0,t0,0x678 # 12345678 <hello_string+0x12345668>
8: 000002b7 lui t0,0x00000
c: 00028293 addi t0,t0,0x000 # mv t0,t0
0000000000000010 <hello_string>: (Omitted, because here is just some ASCII data)
As we can see, the first operation that operates immediate number 0x12345678
has been translated as follows. The operand 0x12345678
is splited into higher 20 bits 0x12345
and lower 12 bits 0x678
, spreaded in 123452b7 : lui t0,0x12345
and 67828293 : addi t0, t0, 0x678
respectively.
0: 123452b7 lui t0,0x12345
4: 67828293 addi t0,t0,0x678 # 12345678 <hello_string+0x12345668>
However, the second operation that operates symbol hello_string
, as we analyse before, the operand is replaced with 0x0
.
8: 000002b7 lui t0,0x00000
c: 00028293 addi t0,t0,0x000 # mv t0,t0
Specifically, compared to the first operation:
0x123452b7
->0x000002b7
ORlui t0,0x12345
->lui t0,0x00000
0x67828293
->0x00028293
ORaddi t0,t0,0x678
->addi t0,t0,0x000
(the same asmv t0,t0
)
Where is the relocation infomation? Just execute riscv64-linux-gnu-objdump -r test.o
.
$ riscv64-linux-gnu-objdump -r test.o
test.o: file format elf64-littleriscv
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000008 R_RISCV_HI20 hello_string
0000000000000008 R_RISCV_RELAX *ABS*
000000000000000c R_RISCV_LO12_I hello_string
000000000000000c R_RISCV_RELAX *ABS*
0000000000000008 R_RISCV_HI20 hello_string
means that the operand of the machine instruction at offset 0x8
of the relocatable file test.o needs to be replaced by higher 20-bits of the symbol hello_string
value when linking.
000000000000000c R_RISCV_LO12_I hello_string
means that the operand of the machine instruction at offset 0xc
of the relocatable file test.o needs to be replaced by lower 12-bits of the symbol hello_string
value when linking.
The linker accepts multiple relocatable files, reads the data and machine instructions from each relocatable file and arranges all needed data and instructions in the final executable file according to specific rules.
The linker needs to know where to load the executable's data and machine instructions at the time of execution,so that it can infer the address values of all symbols.
Then, according to relocation entries of each relocation file, the linker modifies those machine instructions associated with symbols that were previously unresolved during assembling but are now resolved.
Firstly, we need prepare a linking script test.ld.
test.ld contains just a line: . = 0x0;
. It tells the linker that the program load address is 0x0 .
. = 0x0;
Link test.o
with the following command. Note that we add --no-relax
option to prevent the linker from performing optimizations, so that we can compare the machine instructions before and after linking.
riscv64-linux-gnu-ld -T test.ld --no-relax test.o -o test
Then we can get a executable file test
. Just like before, disassemble test
with the command riscv64-linux-gnu-objdump -S test
The instructions of test
is as follows:
$ riscv64-linux-gnu-objdump -S test
test: file format elf64-littleriscv
Disassembly of section .text:
0000000000000000 <hello_string-0x10>:
0: 123452b7 lui t0,0x12345
4: 67828293 addi t0,t0,0x678 # 12345678 <hello_string+0x12345668>
8: 000002b7 lui t0,0x00000
c: 01028293 addi t0,t0,0x010 # 10 <hello_string>
0000000000000010 <hello_string>:
Compare
8: 000002b7 lui t0,0x00000
c: 01028293 addi t0,t0,0x010 # 10 <hello_string>
with before:
8: 000002b7 lui t0,0x00000
c: 00028293 addi t0,t0,0x000 # mv t0,t0
We can see that the symbol hello_string
's value has been modified to 0x00000010
.
The point is that: the symbol hello_string
has been resolved and the associated machine instructions has been fixed!
With command riscv64-linux-gnu-objdump -t test
, we can print the symbols table, and we can see that the value of symbol hello_string
is 0x10
now.
$ riscv64-linux-gnu-objdump -t test
test: file format elf64-littleriscv
SYMBOL TABLE:
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l df *ABS* 0000000000000000 test.o
0000000000000010 l .text 0000000000000000 hello_string
If change the load address to 0x10000000
, the disassembly file is:
$ riscv64-linux-gnu-objdump -S test
test: file format elf64-littleriscv
Disassembly of section .text:
0000000010000000 <hello_string-0x10>:
10000000: 123452b7 lui t0,0x12345
10000004: 67828293 addi t0,t0,0x678 # 12345678 <hello_string+0x2345668>
10000008: 100002b7 lui t0,0x10000
1000000c: 01028293 addi t0,t0,0x010 # 10000010 <hello_string>
0000000010000010 <hello_string>:
And the symbols table is:
$ riscv64-linux-gnu-objdump -t test
test: file format elf64-littleriscv
SYMBOL TABLE:
0000000010000000 l d .text 0000000000000000 .text
0000000000000000 l df *ABS* 0000000000000000 test.o
0000000010000010 l .text 0000000000000000 hello_string
We can see that the the value of symbol hello_string
has been modified accordingly.