Learning assembler on Linux

August 23, 2014 software assembler

Summary

When learning assembler on Linux, choose between AT&T and Intel syntax and select an assembler like GNU Assembler (`as`) or Netwide Assembler (`nasm`). Understand that code for one assembler may not work on the other, and be aware of differences in calling the Linux kernel based on whether you're using a 32-bit or 64-bit processor. Programs in assembly start at `_start`, not `main`, and must explicitly exit; otherwise, they continue running erratically. Study and experiment with syscall numbers and memory management without relying on C standard library functions to deepen your understanding.

For entertainment, I’m learning assembler on Linux. Jotting down some things I learn here.

There are two syntaxes, AT&T and Intel (Go uses it’s own, because Plan 9). They look very different, but once you get over that the differences are minimal. Linux tradition is mostly AT&T syntax, MS Windows mostly Intel.

There’s no standardisation, so each assembler can do things it’s own way. as, the GNU Assembler is the most common one on Linux (and what gcc emits by default), but nasm, the Net wide Assembler is very popular too. Code written for as will not assemble in nasm.

Talking to the Linux kernel is different depending whether you have a 32-bit (x86) or 64-bit (x86-64) processor:

The registers to use change
The instruction to call changes (int 80h vs syscall)
The syscall numbers change

So before you even get started, you need to pick a syntax, an assembler, and a target. I’m using as, with AT&T syntax, on Linux x86-64.

To learn I’m reading Assembly Language Step-by-Step. It’s definitely helpful, but it’s targeted at a CS 101 class which makes it slow going. It also uses Intel syntax, with nasm, on 32-bit, which takes a bit of mental translating.

Here is the first program from that book, translated, in case you want to play too:

.data

eatmsg:
    .ascii "Eat at Joe's!\n"
    eatlen = . - eatmsg

.text

.global _start

_start:
    mov $1, %eax        # 'write' syscall
    mov $1, %edi        # write to stdout (fd 1)
    mov $eatmsg, %rsi   # address of string to write
    mov $eatlen, %edx   # length of string to write
    syscall

    mov $60, %eax       # 'exit' syscall
    mov $0, %edi        # return code 0
    syscall

Save as eatsyscall.s and build with:

as -gstabs -o eatsyscall.o eatsyscall.s
ld -o eatsyscall eatsyscall.o

Other bookmarks I keep open:

GNU Assembler manual. Extremely terse, but it’s there.
Kernel calling convention. Because I forget which registers to use (RDI, RSI, RDX, R10, R8, and R9 – yes 10 8 9 at the end that’s not a typo).
AMD manuals, particularly Part 3 – General Purpose Instructions.
/usr/include/x86_64-linux-gnu/asm/unistd_64.h for the syscall numbers, and man 2 <syscall name> for what to pass them.
Programming from the ground up. This looks promising, and uses the same syntax and assembler as me. I haven’t gotten to reading it yet.

I’ve already learnt two interesting things, about starting and stopping programs.

Programs don’t start at main, they start at _start. When you build a C program, _start is put in for you, does some setup, then calls main. _start is the symbol the linker ld looks up to know what address to put in the ELF header as the entry point address. For a different example, the Go start symbol (on x86-64 linux) is _rt0_amd64_linux.

Programs have to explicitly exit. If you don’t call the exit (or exit_group) system call, your program keeps on running, tries to get it’s next instruction from whatever comes right after it in memory, and crashes.

You can call all of the C stdlib functions from assembler, by using gcc to link, or passing the right arguments to ld. Or you can not be so lazy, and do everything yourself!

That’s the part I’m most excited about. How am I going to allocate memory, without malloc? No, don’t answer that. The fun is in figuring it out.

Graham King