Skip to content

Latest commit

 

History

History

Appendix-00-Assembly-and-Linking

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Assembly and Linking Basics

Related file types and concepts

  • machine instruction
  • assembly source code file
  • relocatable file
  • executable file

Assembly and linking workflow

  • The CPU can only decode and execute machine instructions.
  • An assembly source code file consists of assembly instructions, which are symbolic representations of machine instructions.
  • The assembler can translate an assembly source file into a relocatable file.
  • A relocatable file contains machine instructions and some meta information.
  • A relocatable file can't be loaded and executed directly, because it may reference some symbols whose address values are not known yet.
  • Multiple relocatable flles can be linked to resolve unknown symbols and form an executable file.
  • An executable file contains machine instructions to be executed, and its meta information specifies the memory address where the program should be loaded.

Why does a symbol with an unknown address value exist?

In this section, we learn how an assembly source file is converted into an executable file step by step, and why does a symbol with an unknown address value exist?

Example assembly source file

Take the following assembly source code file test.s as an example.

.text

/* load a immediate value to t0 */
lui t0, %hi(0x12345678)        /* higher 20 bits */
addi t0, t0, %lo(0x12345678)   /* lower 12 bits  */

/* load a symbol value to t0 */
lui t0, %hi(hello_string)
addi t0, t0, %lo(hello_string)

/* define a string */
hello_string:
.string "Hello, RISC-V!"

The source file performs three operations:

  1. Load an immediate number into the t0 register with two instructions lui and addi
  2. Load a symbolic hello_string into the t1 register with two instructions lui and addi
  3. Define a string with the symbol hello_string

Assembly

In short, assembly is the process of reading assembly instructions from assembly source file one by one and translating them into machine instructions.

Example Analysis

What should the result look like after assembly? Let's analyse it.

  • The third operation is just a string definition, the assembler just needs to convert it to ASCII.
  • For the first operation, whose operand is an immediate number, the assembler can generate the corresponding machine instruction exactly.

However, the second statement contains a symbol hello_string. The value of the symbol actually represents a memory address where the data or instructions following the symbol in assembly source code should be loaded upon execution.

The problem is that when assembling, the assembler doesn't know where the data or instructions which followed a symbol should be loaded upon execution, thus don't know the symbol value.

In addition to the local symbols, an assembly file can also reference symbols in other assembly files. These symbols obviously cannot be resolved when assembling an assembly source file.

Unresolved symbols

At assembly time, some symbols are not immediately resolved. When the assembler meets a assembly instruction with a unresolved symbol, it treats the symbol value as zero, then translates the assembly instruction to a machine instruction. Obviously, the data bits representing the unresolved symbol are also zero in the translated machine instructions.

To ensure that the unresolved symbols are finally resolved correctly, the assembler places additional auxiliary information in the generated target file. These auxiliary information are used to assist in symbol resolution or symbol relocation, so the target files generated by the assembler are called relocatable files.

Assemble example file and explore its content

We can assemble the example file to a relocatable file with riscv64-linux-gnu-as .

riscv64-linux-gnu-as test.s -o test.o

The above command assembles test.s and generates the relocatable file test.o .

Let's explore test.o !

With command riscv64-linux-gnu-objdump -S test.o , we can disassemble test.o .

$ riscv64-linux-gnu-objdump -S test.o

test.o:     file format elf64-littleriscv


Disassembly of section .text:

0000000000000000 <hello_string-0x10>:
   0:   123452b7                lui     t0,0x12345
   4:   67828293                addi    t0,t0,0x678 # 12345678 <hello_string+0x12345668>
   8:   000002b7                lui     t0,0x00000
   c:   00028293                addi    t0,t0,0x000 # mv t0,t0

0000000000000010 <hello_string>: (Omitted, because here is just some ASCII data)

As we can see, the first operation that operates immediate number 0x12345678 has been translated as follows. The operand 0x12345678 is splited into higher 20 bits 0x12345 and lower 12 bits 0x678, spreaded in 123452b7 : lui t0,0x12345 and 67828293 : addi t0, t0, 0x678 respectively.

   0:   123452b7                lui     t0,0x12345
   4:   67828293                addi    t0,t0,0x678 # 12345678 <hello_string+0x12345668>

However, the second operation that operates symbol hello_string, as we analyse before, the operand is replaced with 0x0.

   8:   000002b7                lui     t0,0x00000
   c:   00028293                addi    t0,t0,0x000 # mv t0,t0

Specifically, compared to the first operation:

  • 0x123452b7 -> 0x000002b7 OR lui t0,0x12345 -> lui t0,0x00000
  • 0x67828293 -> 0x00028293 OR addi t0,t0,0x678 -> addi t0,t0,0x000 (the same as mv t0,t0)

Where is the relocation infomation? Just execute riscv64-linux-gnu-objdump -r test.o .

$ riscv64-linux-gnu-objdump -r test.o

test.o:     file format elf64-littleriscv

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000008 R_RISCV_HI20      hello_string
0000000000000008 R_RISCV_RELAX     *ABS*
000000000000000c R_RISCV_LO12_I    hello_string
000000000000000c R_RISCV_RELAX     *ABS*

0000000000000008 R_RISCV_HI20 hello_string means that the operand of the machine instruction at offset 0x8 of the relocatable file test.o needs to be replaced by higher 20-bits of the symbol hello_string value when linking.

000000000000000c R_RISCV_LO12_I hello_string means that the operand of the machine instruction at offset 0xc of the relocatable file test.o needs to be replaced by lower 12-bits of the symbol hello_string value when linking.

Linking

The linker accepts multiple relocatable files, reads the data and machine instructions from each relocatable file and arranges all needed data and instructions in the final executable file according to specific rules.

The linker needs to know where to load the executable's data and machine instructions at the time of execution,so that it can infer the address values of all symbols.

Then, according to relocation entries of each relocation file, the linker modifies those machine instructions associated with symbols that were previously unresolved during assembling but are now resolved.

Link and explore the executable file

Firstly, we need prepare a linking script test.ld.
test.ld contains just a line: . = 0x0; . It tells the linker that the program load address is 0x0 .

. = 0x0;

Link test.o with the following command. Note that we add --no-relax option to prevent the linker from performing optimizations, so that we can compare the machine instructions before and after linking.

riscv64-linux-gnu-ld -T test.ld --no-relax test.o -o test

Then we can get a executable file test. Just like before, disassemble test with the command riscv64-linux-gnu-objdump -S test

The instructions of test is as follows:

$ riscv64-linux-gnu-objdump -S test

test:     file format elf64-littleriscv


Disassembly of section .text:

0000000000000000 <hello_string-0x10>:
   0:   123452b7                lui     t0,0x12345
   4:   67828293                addi    t0,t0,0x678 # 12345678 <hello_string+0x12345668>
   8:   000002b7                lui     t0,0x00000
   c:   01028293                addi    t0,t0,0x010 # 10 <hello_string>

0000000000000010 <hello_string>:

Compare

   8:   000002b7                lui     t0,0x00000
   c:   01028293                addi    t0,t0,0x010 # 10 <hello_string>

with before:

   8:   000002b7                lui     t0,0x00000
   c:   00028293                addi    t0,t0,0x000 # mv t0,t0

We can see that the symbol hello_string's value has been modified to 0x00000010.

The point is that: the symbol hello_string has been resolved and the associated machine instructions has been fixed!

With command riscv64-linux-gnu-objdump -t test, we can print the symbols table, and we can see that the value of symbol hello_string is 0x10 now.

$ riscv64-linux-gnu-objdump -t test

test:     file format elf64-littleriscv

SYMBOL TABLE:
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 l    df *ABS*  0000000000000000 test.o
0000000000000010 l       .text  0000000000000000 hello_string

What happens if we change the load address?

If change the load address to 0x10000000, the disassembly file is:

$ riscv64-linux-gnu-objdump -S test

test:     file format elf64-littleriscv


Disassembly of section .text:

0000000010000000 <hello_string-0x10>:
    10000000:   123452b7                lui     t0,0x12345
    10000004:   67828293                addi    t0,t0,0x678 # 12345678 <hello_string+0x2345668>
    10000008:   100002b7                lui     t0,0x10000
    1000000c:   01028293                addi    t0,t0,0x010 # 10000010 <hello_string>

0000000010000010 <hello_string>:

And the symbols table is:

$ riscv64-linux-gnu-objdump -t test

test:     file format elf64-littleriscv

SYMBOL TABLE:
0000000010000000 l    d  .text  0000000000000000 .text
0000000000000000 l    df *ABS*  0000000000000000 test.o
0000000010000010 l       .text  0000000000000000 hello_string

We can see that the the value of symbol hello_string has been modified accordingly.

References