- Introduction
- โจSome Basicsโจ (Very Very Important)
- Compiling and Execution
- Chapters ๐
- Important topics
- Assembly program for x86_64 processor (64-bit)
- Examples
- Leetcode
- Tools
- Why learn assembly language?
- References
This repository contains the assembly programs I implemented while studying assembly programming from many different sources, including the book "Programming from the Ground Up" by Jonathan Bartlell.
This book is highly recommended if you want to understand how a computer runs programs, how memory is allocated, and how data moves back and forth between RAM and the CPU (registers) to do calculations and save results. Additionally, you'll gain a thorough grasp of how high-level programs like C/C++ compile down to machine code which computers can understand and execute.
These assembly programs in this repository are 32-bit and 64-bit programs for an x86 processor and Linux operating system with AT&T syntax which can be compiled using GNU/GCC compiler.
๐ Importantโ As I am new to assembly programming, the information provided in this repository might not be entirely accurate or error-free, despite my best efforts to prevent them. The knowledge contained in this repository is the result of information I've learned from a variety of sources.
Your comments are very appreciated if you discover any incorrect information. For it, you may create a PR. For further information, see the contribution guidelines for your contributions.
๐ The reader is assumed to have a fundamental understanding of C/C++ in order to follow this assembly programming lesson. You can relate to the idea discussed in this guide more if you are proficient in C/C++. To comprehend the notion in the upcoming context, we will use the C programming language as base language.
๐ The programs in this repository can be executed using GCC compiler on Linux and Windows(by installing WSL).
What does a basic program look like in C?
- variable definition (char, int, long, array).
- some if/else conditions.
- some loops.
- calling functions.
That's it, almost every programming language has at least these four things that we can do, but the question is how the computer interprets and run it.
Every program requires CPU and RAM to run, I'm not saying only these two are required, but for understanding the basics we need to focus on these two only.
Every CPU has some general purpose registers and some special registers, We can think of these registers as memory locations in the CPU. For example, the x86 32-bit processor has the general-purpose register such as %eax
, %ebx
, %ecx
, %edx
, %edi
and %esi
In addition to these, there are also some special-purpose registers, including:
%ebp
: Base pointer register%esp
: Stack pointer register%eip
: Instruction pointer register%eflags
: status register
Depending on the CPU architecture, each register can hold data of size either 32-bit
for a 32-bit processor or 64-bit
for a 64-bit processor. Since the number of variables used can be more than the number of registers, We need to store data somewhere else.
RAM enters the picture at this point. Data can be kept in RAM and pointed to a register using its RAM address. As a result, during program execution, data may transfer from RAM to CPU and, following processing, may return to RAM.
โจNote: In the incoming sections, you will learn assembly for 32-bit processor. Once you are comfortable with 32-bit assembly programming you can move to 64-bit assembly pragramming here.
Operating system features are accessed through system calls. These system calls are invoked by setting up registers in a special way and issuing the instruction int $0x80
where int stands for interrupt. Kernal knows which system call you want to access by what you store in %eax
register. Each system call has other requirements as to what needs to be stored in the different registers.
Conditional statements, loops and functions are discussed in detail in upcoming topics.
The code can be compiled, linked, and executed as follows:
# compile
as -32 exit.s -o exit.out
# link
ld -m elf_i386 -s exit.out -o exit
# execute
./exit
# check if the exit code is correct
echo $? # prints exit code which the above program emits
These chapters walk you through step-by-step assembly programming using concepts from other programming languages that you are likely already familiar with such as conditions, loops, functions, printing to console and many others.
You will learn it in these chapters along with the example of how you write it in C
language.
Your first program would have been a Hello World program for the majority of programming languages you may have learned up to this point, but developing a Hello World program in assembly is a bit more difficult and is covered in later chapters.
So what will be the first assembly program we write?
We will write a very simple program that exits with a certain exit code. You can write a C program for it as:
int main() {
return 0;
}
To accomplish the same in assembly
.section .text
.globl _start
_start:
movl $1, %eax # sys call for exit
movl $10, %ebx # exit status code
int $0x80 # wake kernal to exit
-
The First line in the above code tells the compiler that the logic of the program starts from there which is
.text
section. There can be other sections in the program such as.data
section and.bss
section which is discussed later. -
.globl _start
tells from which block of code execution of the program should start. -
_start:
tells_start
block starts here. -
The
movl
is an assembly instruction with twooperands
that says the data to move from first to second. Check more instructions in the assembly instructions section. -
$
sign before the first operand indicated we want to use immediate addressing mode which embeds the data into the instruction itself. Please check [data accessing methods for a details discussion on it. -
1 is set into
%eax
register for the exit syscall and%ebx
is set with an exit code value that is 10. -
int $0x80
is used to interrupt the kernel and issue the system call.
Every instruction's
l
suffix informs the CPU that the register width to utilize is 32 bits. Although not always necessary, it becomes crucial when programming assembly in64-bit.
To verify the exit code of the above program compile and run, then
echo $? # will output 10
Exit status codes were previously used as a program's output, however, they shouldn't be used for this. You must be aware by this point that a system call, such as the one you just made by setting the %eax
register to 1, is required to execute any type of I/O. In the same way, you would need to make a system call to print something to the console.
To accomplish this, set the %eax
register to 4 to initiate a write system call. You need to set the %ebx
register to the file descriptor value, which is 1 for stdout
, the %ecx
register to the location of the buffer, and the %edx
register to the size of the message to print.
For a thorough description of opening, closing, reading, and writing files, refer to the files section.
Here is an example of an assembly program
.globl _start
.section .data
msg:
.ascii "Hello World\n"
.section .text
_start:
movl $4, %eax # sys call for write
movl $1, %ebx # set fd which is 1 for stdout
movl $msg, %ecx # set buffer address
movl $12, %edx # set msg size
int $0x80 # interrupt kernel to make sys call
# exit program with successfull status code
movl $1, %eax
movl $0, %ebx
int $0x80
Programming relies heavily on condition statements, such as the if, if-else, and else statements. To do this in assembly, use the jump instruction to jump to another section of the program. With the use of jump, you can make several branches that help in managing the program's workflow.
We must first compare some values using the cmpl
instruction, which saves the outcome in a specially designated register called the %eflags
register, before we may conditionally jump. Below is the list of instructions to jump based on the result of compare.
je
: Jump if values were equal.jg
: Jump if the second value is greater than the first.jge
: Jump if the second value is greater than or equal to the first.jl
: Jump if the second value is less than the first.jle
: Jump if the second value less than or equal to the first.jmp
: Jump no matter what. It does not need to be proceeded by a comparison.
For example:
int main() {
int x = 10;
if (x >= 9)
return x;
return 0;
}
We can write the corresponding assembly code as:
.globl _start
.section .text
_start:
movl $1, %eax # exit sys call
movl $10, %ebx # put value in %ebx, which is exit status of register also
movl $9, %ecx # put value to compare in condition in %ecx
cmpl %ecx, %ebx # compare 10 with 9 if its greater than equal to
jge end_block # if condition meet jump to the end
movl $0, %ebx # otherwise set 0 in exit status register
end_block:
int $0x80 # interrupt kernel
Similar to if/else conditions you can use jump instruction to create a loop in the assembly.
For example:
int main() {
int sum = 0;
int i = 10;
while (i > 0) {
sum += i;
i--;
}
return sum;
}
The corresponding assembly program can be written as:
.globl _start
.section .text
_start:
movl $1, %eax # set sys call
movl $0, %ebx # initialze %ebx (status code register) to 0
movl $10, %ecx # store 10 in %ecx
loop:
addl %ecx, %ebx # add current value to %ebx
decl %ecx # decrease %ecx by 1
end:
cmpl $0, %ecx # check if greater than 0
jg loop # if so jump to loop until the condition is made
int $0x80 # interrupt kernel
Functions are a crucial component of programming that helps in the development of code that is reusable, modular, and maintained. You must have been taught that when we call a function stack is used internally to keep track of data used from where it is called and of functions while studying functions in other programming languages. You will see how this is done in assembly.
We only have a limited amount of registers, thus when calling a function, you must keep local variables (in registers) where they won't be lost because the function may mutate and utilize the same register. Additionally, you need to have a structured manner to store data so that you can simply restore it after running the function. %esp
, a unique register that is referred to as a stack register, is used to help with this.
A program should use the pushl
instruction to push all of the function's parameters onto the stack before the function is executed, in the opposite order that they are listed in the documentation. Then issue a call
instruction specifying which function name to call. It initially pushes the return address, which is the address of the following instruction, into the stack. After that, it changes the instruction pointer %eip
to refer to the function's start.
Note: Computer stack expands downward until it reaches the text or data portions of programs, at which point it crash programs due to a stack overflow error.
In a function, you can access all of the data by using a base pointer using a different offset from %ebp
. %ebp
was made specifically for this purpose which is why it is called the base pointer.
โจImportantโจ: Following are the step to set up function parameters, define a function, call the function, and give control back to the point where it is called.
- push function parameters in reverse order using
pushl
instruction. - call the function by issuing a
call
instruction with the function name. - define function anywhere in the program file as
.type <function_name>, @function
. - start the function block with the function name.
- Now the first two instructions should be to store the old base pointer in the stack and make the stack current position your base point.
pushl %ebp # save old base pointer movl %esp, %ebp # make stack pointer the base pointer
- Get the function parameter by using the
%ebp
register in base pointer addressing mode. - Do calculations.
- store the return value in
%eax
. - reset the stack to what it was when it was called by using the instructions
movl %ebp %esp
andpopl %ebp
. - return control back to wherever it was called from by using
ret
instruction.
For example:
int sum(int a, int b) {
return a + b;
}
int main() {
int s = sum(10, 20);
return s;
}
The corresponding assembly program will look like this:
pushl $10 # push second arg
pushl $20 # push first arg
call fun_name # call function, which stores the result in %eax register
addl $8, %esp # reset stack (function cleanup)
# get the result as exit code
movl %eax, %ebx
movl $1, %eax
int $0x80
# tell compiled that fun_name is a function
.type fun_name, @function
# function definition
.fun_name:
pushl %ebp # save old base pointer
movl %esp, %ebp # make stack pointer the base pointer
movl 8(%ebp), %ebx # get first arg
movl 12(%ebp), %ecx # get second arg
movl %ebx, %eax # copy first val to res
movl %ecx, %eax # add second val to res
# restore stack before returning
movl %ebp, %esp # restore stack pointer
popl %ebp # restore base pointer
ret
The "Hello World" program mentioned above uses the system call write
to print a message to the console, although this is not advised because it is difficult to fill up each block of buffer with an ASCII value if you wish to display a dynamic message.
Therefore, we can utilize the standard printf
function with signature int printf(const char *restrict format, ...);
provided by libc
rather than making a direct sys call.
For example:
#include <stdio.h>
int main() {
printf("Value of x is %d\n", 10);
return 0;
}
We can write the corresponding assembly code as:
.globl _start
.section .data
msg:
.ascii "Value of x is %d\n\0"
.section .text
_start:
pushl $10 # set second param as int value
pushl $msg # set first param as formatted string
call printf # call printf function
# exit program with success exit code
movl $1, %eax
movl $0, %ebx
int $0x80
To compile the above code you need to tell the linker to link libc
so that you can utilize the printf
function provided by it. We can compile, link and execute the above code as:
as -32 main.asm -o main.out
ld -dynamic-linker /lib/ld-linux.so.2 main.out -m elf_i386 -s -o run -lc
./run
# output: Value of x is 10
You can see that the format string is terminated with a NULL
since printf
requires a string buffer and assumes that the buffer's endpoint is terminated with a NULL
.
๐ Similary, You can use other functions provided in
libc
, and to get their signature you can check their man page.
There are a number of ways to write inline assembly in the C
programming language, but the one given here is sufficient for the needs of the moment.
You can write assembly code to implement a function. However, you must inform the compiler that the code contained in the function body is raw assembly code. For that reason, we add __attribute__((naked))
before a function definition. Get more info about it here. After that we use __asm__
to encapsulate our assembly code.
For example:
// add.c
#include <stdio.h>
__attribute__((naked))
int sum(int a, int b) {
__asm__(
"pushl %ebp;"
"movl %esp, %ebp;"
"movl 8(%ebp), %eax;" // s = a
"addl 12(%ebp), %eax;" // s += b
"movl %ebp, %esp;"
"popl %ebp;"
"ret;"
);
}
int main() {
printf("%d\n", sum(10, 20));
return 0;
}
To compile and run this code
# using -m32 to compile in 32-bit mode
gcc -m32 add.c -o add.out
./add.out
# output: 30
-
movl
: It has two operands, source and destination i.e.movl $src_reg, %dest_reg
. -
addl
: Add the source operand to the destination. -
subl
: Subract the source operand from the destination. -
imull
: Multiply the source operant by the destination. -
incl
: Increase the value by 1, likei++
-
decl
: Decrease the value by 1, likei--
-
idivl
: Requires that dividend in%eax
and%edx
be zero, the quotient is then transferred to%eax
and the remainder to%edx
. However, the divisor can be any register or memory location.
l
suffix after every intruction tell cpu that with of register to use is 32-bit.
The general form of memory address reference is:
address_Or_offset(base_Or_offset,index,multiplier)
Above all the fields are options, To calculate the address use the formula:
final_address = address_Or_offset + base_Or_offset + multiplier * index
multiplier
andaddress_Or_offset
both must be constant, while the other two must be registers. If any of the pieces are left out, it is just substituted with zero.
You can access data in different ways.
-
Immediate mode: This is the simplest mode in which data to access is embedded in the instruction itself. Example:
movl $0, %eax
This load registers
%eax
with a value of 0.$
indicates you want to use immediate mode addressing. -
Register addressing mode: In this instruction contains a register to access, rather than a memory location. Example:
movl %eax, %ebx
Copy value stored in register
%eax
to register%ebx
. -
Direct addressing mode: In this, the instruction contains the memory address to access. For example, you can say, Please load this register with data at the address at 200. Example:
movl ADDR, %eax
The above program loads the register
%eax
value at the memory address ADDR. -
Index addressing mode: In this instruction contains a memory address along with an index register that specifies the offset to that address.
.section .data .int 1,2,3,4 . . movl data_start(,%ecx, 2), %eax
Multiplier is set as 2 here, as the size of int is 2 bytes.
%ecx
contains the index of data to access. -
Indirect addressing mode: In this instruction contains a register that contains a pointer to where the data should be accessed. If
%eax
held an address, you can move the value at that address to%ebx
asmovl (%eax), %ebx
-
Base pointer addressing mode: Similar to indirect addressing mode, it includes a number called the offset to add to the register's value before using it for lookup. For example, if you have a record where the value is 4 bytes into the record, and you have the address of the record in
%eax
, you can retrieve the value into%ebx
asmovl 4(%eax), %ebx
%eax
will hold 5 for the sys call- address of the first character if the filename should be stored in
%ebx
. - Read/Write indentions represented as a number should be stored in
%ecx
. You can use 0 for files you want to read from and 03101 for files you want to write to. - Files permission should be stored as a number in
%edx
. You can in general use 0666 for permissions.
movl $5, %eax
movl $0, %ebx
movl $0666, %ecx
int $0x80
The above instruction will return a file description in %eax
. This number you can use to refer to this file throughout your program.
- read and write is a system call with values 3 and 4 respectively.
- fd should be stored in
%ebx
. - The address of a buffer for the data that is to be read is stored in
%ecx
. - The size of the buffer should be stored in
%edx
..bss.read
will return either number of bytes read or the error code. In the case of write,%eax
will contain the number of bytes written or an error code.
The close system call is 6. The only parameter to close
is the fd placed in %ebx
.
.bss
section of the program is like the data section, except that it doesn't take space in the executable. This section can reserve storage, but you can't initialize it. In the.data
section, you can't set an initial value.
For Example :
.section .bss
.lcomm my_buffer, 500 # It will reserve 500 bytes of storage which you
# can use as a buffer
- STDIN: 0, it is a read-only file.
- STDOUT: 1, it is a write-only file.
- STDERR: 2, it is a write-only file.
Since writing assembly in 32-bit
mode is simple, up until now, our main focus has been on learning assembly programs and developing some fundamental programming skills.
When developing assembly programs, there are no significant differences between 32-bit
and 64-bit
modes. Check out the 'x86_64' directory and accompanying instructions to learn assembly in '64-bit' mode.
sno | program | topic |
---|---|---|
1 | exit.asm | exit sys call |
2 | hello_world.asm | write sys call |
3 | hello_world_lib.asm | libc function call |
4 | greatest.asm | array(buffer), loops, condition |
5 | add.asm | function(procedure), call stack |
6 | factorial.asm | function, recursion, condition, loops |
7 | power_iter.asm | function, loops, condition |
8 | power_rec.asm | function, recursion, loops, condition |
sno | program | topic |
---|---|---|
1 | sum.asm | buffer, loops, condition, libc function |
2 | add.asm | inline asm |
3 | add_arr.asm | buffer, loops, condition, inline asm |
4 | 2d_arr.asm | nested buffer, loops, condition, inline asm |
5 | malloc.asm | malloc, loops, condition, inline asm |
6 | malloc_2d.asm | malloc for 2d buffer, loops, condition, inline asm |
Once you have mastered x86_64
assembly, you can write inline assembly
in C
to solve problems on leetcode
. The leetcode
directory contains some solutions along with their C
annotation.
Below is the list of programs:
sno | question | level | topic | assembly topics | Open |
---|---|---|---|---|---|
1. | Candy | hard | array, greedy | malloc, free, loop, conditions | open |
2. | Climbing Stairs | easy | Math, Dynamic Programming, Memoization | loop, conditions | open |
3. | find the duplicate number | medium | array,two pointer, binary search | loop, conditions | open |
4. | Fibonacci numbers | easy | array,two pointer, binary search | loop, conditions | open |
5. | Find the difference | easy | hash, string | byte(char), loops | open |
6. | sqrt(x) | easy | math, binary search | long int <-> int, loops | open |
7. | Min cost climbing stairs | easy | array, dp | loops, conditions,malloc,free | open |
8. | minimum operations to reduce an integer to 0 | medium | dp, bit manipulation | recurrsion, conditions | open |
9. | missing number | easy | hash, math, array | loop, conditions | open |
10. | plus one | easy | math, array | malloc, free, loop, conditions | open |
11. | single number | easy | array, bit manipulation | bit manipulation, loop | open |
11. | two sum | easy | array, hash table | condition, nested loop | open |
12. | asteroid collision | medium | array, stack | condition, nested loop, malloc | open |
12. | search in a binary search tree | medium | binary tree | condition, recurrsion, struct | open |
This is a very amazing tool which compiles code in C/C++
to assembly code for various versions/type of compilers. Using this tool you can see how you C/C++
program would look like in assembly. It can also generate various flavours of assembly code such as we studied AT&T
syntax or Intel
syntax of assembly.
The best combination of option for our code is like:
-
32-bit:
- set compiler to
x86-64 gcc 4.1.2
- set compiler options to
-m32 -O2
whereO2
is optimization level. - Disable
intel asm syntax
from output dropdown.
- set compiler to
-
64-bit:
- set compiler to
x86-64 gcc 4.1.2
- set compiler options to
-m64 -O2
whereO2
is optimization level. - Disable
intel asm syntax
from output dropdown.
- set compiler to
You can even compare side by side various assembly code generated for various flavours.
This tool simulates how a C++
program might look if it were written in the C
programming language.
๐ By learning assembly language you will have a better understanding of operating and you would know how your code is compiled and run behind the scene.
๐ You can wisely decide to prevent yourself from writing unoptimized code as well as from premature optimization.
๐ By analyzing assembly code, you can identify vulnerabilities, and develop exploits in software.
๐ You can reverse engineer software, believe me, once you can understand assembly, reverse engineering becomes easy.
-
x86-64 Machine-Level Programming by Randal E. Bryant and David R. OโHallaron