asm

x86-64 assembly

history: AMD constructed 64-bit extension to x86 first
- marketing term: AMD64
Intel first tried a new ISA (Itanium), which failed
Then Intel copied AMD64
- marketing term: EM64T
  - Extended Memory 64 Technology
- later marketing term: Intel 64
both Intel and AMD have manuals — definitive reference

Title page for Intel 64 and IA-32 Architectures Software Developer's Manual; Combined Volumes 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, and 3D

x86-64 manuals

Intel manuals:
- https://software.intel.com/en-us/articles/intel-sdm
- 25 MB, 5240 pages
- Volume 2: instruction set reference (2591 pages)
AMD manuals:
- https://support.amd.com/en-us/search/tech-docs
- ‘‘AMD64 Architecture Programmer’s Manual’’

example manual page (1)

Top of the Intel Instruction set reference page for INC. Includes information about all the forms of the instruction (with different size register, memory operands), how they are encoded (with information about the opcode and how the argument is represented), and what modes (e.g. 32- v 64-bit) they are encoded in.

example manual page (2)

Second part of the Intel intsruction set. Includes a text description of the isntruction and its operation, what flags are affected, and what exceptions are possible.

instruction listing parts (1)

opcode — first part of instruction encoding
- yes, variable length
- ‘‘REX’’???
- more later (Friday or next week)
instruction — Intel assembly skeleton
r/m32 = 32-bit memory or register value
64-bit mode — does instruction exist in 64-bit mode?
compat/leg mode — in 16-bit/32-bit modes?

instruction listing parts (2)

description + operation (later on page)
- text and pseudocode description
flags affected
- flags — used by jne, etc.
exceptions — how can OS be called from this?
- example: can invalid memory access happen?

recall: x86-64 general purpose registers

diagram from Immae via Wikipedia

overlapping registers (1)

setting 32-bit registers sets whole 64-bit register
extra bits are always zeroes

movq $0x123456789abcdef, %rax
    // Intel: MOVABS RAX, 0x123456789abcdef
xor %eax, %eax
// %rax is 0, not 0x1234567800000000
movl $-1, %ebx
    // Intel: MOV EBX, -1
// %rbx is 0xFFFFFFFF, not -1 (0xFFFFF...FFF)

32-bit instructions are often shorter than 64-bit ones,
so compilers will prefer mov $1234, %ecx to mov $1234, %rcx

overlapping registers (2)

setting 8/16-bit registers doesn’t change rest of 64-bit register:

movq $0x12345789abcdef, %rax
movw $0xaaaa, %ax
// %rax is 0x123456789abaaaa

AT&T versus Intel syntax

AT&T syntax:
movq $42, 100(%rbx,%rcx,4)
Intel syntax:
mov QWORD PTR [rbx+rcx*4+100], 42
effect (pseudo-C):
memory[rbx + rcx * 4 + 100] <- 42

AT&T syntax (1)

movq $42, 100(%rbx,%rcx,4)

destination last
constants start with $
registers start with %

AT&T syntax (2)

movq $42, 100(%rbx,%rcx,4)

operand length: q
- l = 4; w = 2; b = 1
- can be omitted when implied by context
100(%rbx,%rcx,4):
memory[100 + rbx + rcx * 4]
sub %rax, %rbx: rbx ← rbx - rax

Intel syntax

mov QWORD PTR [RBX + RCX * 4 + 100], 42

destination first
[…] indicates location in memory
QWORD PTR […] for 8 bytes in memory
- DWORD for 4
- WORD for 2
- BYTE for 1
- can be omitted when implied by context

On LEA

LEA = Load Effective Address
uses the syntax of a memory access, but…
just computes the address and uses it:
leaq 4(%rax), %rax same as addq $4, %rax
- almost — doesn’t set condition codes

LEA tricks

leaq (%rax,%rax,4), %rax multiplies %rax by 5
- address-of(memory[rax + rax * 4])
leal (%rbx,%rcx), %eax adds rbx + rcx into eax
- ignores top 64-bits

question

.data
string:
    .asciz "abcdefgh"
.text
    movq $string, %rax // mov RAX, STRING
    movq string, %rdx  // mov RDX, [STRING]
    movb (%rax), %bl   // mov BL, [RAX]
    leal 1(%rbx), %ebx // lea EBX, [RBX+1]
    movb %bl, (%rax)   // mov [RAX], BL
    movq %rdx, 4(%rax) // mov [4+RAX], RDX

What is the final value of string?

"abcdabcd"
"bbcdefgh"
"bbcdabcd"
"abcdefgh"
something else / not enough info

reading objdump disassembly

often, we’ll want to work from binaries to assembly
tool we’ll use on Linux: objdump
from objdump --disassemble:

0000000000001060 <main>:
  1060:  f3 0f 1e fa           endbr64 
  1064:  50                    push   %rax
  1065:  48 8d 3d 98 0f 00 00  lea    0xf98(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
  106c:  e8 df ff ff ff        callq  1050 <puts@plt>
  1071:  31 c0                 xor    %eax,%eax
  1073:  5a                    pop    %rdx
  1074:  c3                    retq

symbol main at address 0x1060

first column: instruction addresses in hexadecimal
(if executable/library has fixed address,
these are the addresses they’ll be loaded in memory)

after instruction addresses:
machine code as list of byte values in hexadecimal

callq 1050 <puts@plt> = call to address 0x1050
puts@plt is the label of that address

comment after lea annotates instruction computed:
0xf98(%rip)=0x2004 (0x4 bytes after the label _IO_stdin_used)

floating point operations

x86 has two ways to do floating point
method one — legacy: x87 floating point instructions
- still common in 32-bit x86
method two — SSE instructions
- work more like what you expect

XMM registers

%xmm0 through %xmm15 (%xmm8 on 32-bit)
each holds 128-bits —
- 32-bit floating point values (addps, etc.)
- 64-bit floating point values (addpd, etc.)
- 64/32/16/8-bit integers (paddq/d/w/b, etc.)
- a 32-bit floating point value, 96 unused bits (addss, movss, etc.)
- a 64-bit floating point value, 64 unused bits (addsd, movsd, etc.)
more recently: %ymm0 through %ymm15 (256-bit, ‘‘AVX’’)
- overlap with %xmm X registers

FP example

multiplyEachElementOfArray:
/* %rsi = array, %rdi length,
   %xmm0 multiplier */
loop:   test %rdi, %rdi
        je done
        movss (%rsi), %xmm1
        mulss %xmm0, %xmm1
        movss %xmm1, (%rsi)
        subq $1, %rdi
        addq $4, %rsi
        jmp loop
done:   ret

label(%rip)

AT&T 0x1234(%rip) / Intel [RIP + 0x1234]
- value in memory 0x1234 bytes after current instruction

thing: .quad 42
…
movq thing(%rip), %rax
- special case in Linux assembler:
  if movq ends at 0x7000 and thing is at 0x5000, then…
  = movq -0x2000(%rip), %rax (not movq 0x5000…)

string instructions (1)

memcpy: // copy %rdx bytes from (%rsi) to (%rdi)
        cmpq %rdx, %rdx
        je done
        movsb
        subq $1, %rdx
        jmp memcpy
done:   ret

movsb (move data from string to string, byte)
mov one byte from (%rsi) to (%rdi)
increment %rsi and %rdi (*)
cannot specify other registers

string instructions (2)

memcpy: // copy %rdx bytes from (%rsi) to (%rdi)
    rep movsb
    ret

rep prefix byte
repeat instruction until %rdx is 0
decrement %rdx each time
cannot specify other registers
cannot use rep with all instructions

string instructions (3)

lodsb, stosb — load/store into string
movsw, movsd — word/dword versions
string comparison instructions
rep movsb is still recommended on modern Intel
- special-cased in processor?

ENDBR64?

partial output of objdump –disassemble:


0000000000001060 <main>:
  1060:  f3 0f 1e fa           endbr64 
  1064:  50                    push   %rax
  1065:  48 8d 3d 98 0f 00 00  lea    0xf98(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
  106c:  e8 df ff ff ff        callq  1050 <puts@plt>
  1071:  31 c0                 xor    %eax,%eax
  1073:  5a                    pop    %rdx
  1074:  c3                    retq

endbr64: no-op instruction that marks destination of branches
why? — we’ll explain (much) later

recall(?): virtual memory

illuision of dedicated memory:

Diagram showing with program A and program B addresses represented on the left; then passed through separate mappings set by the OS which point to 'real memory'. The real memory separates the pogram A and program B code. Some of the arrows are dotted, which a legend indicates means 'kernel-mode only', and some go to 'trigger error' instead of a part of 'real memory'.

segmentation

before virtual memory, there was segmentation

x86 segmentation

addresses you’ve seen are the offsets
but every access uses a segment number!
segment numbers come from registers
default segment regsiter based on instruction type
- CS — code segment number (jump, call, etc.)
- SS — stack segment number (push, pop, etc.)
- DS — data segment number (mov, add, etc.)
- ES — addt’l data segment (string instructions)
- FS, GS — extra segments (never default)
instructions can have a segment override:
- movq $42, %fs:100(%rsi) = move 42 to {segment # in FS:offset 100 + RSI}

Figure from Intel manuals, Vol 3A

x86 segment descriptor

Figure from Intel manuals, Volume 3A

64-bit segmentation

in 64-bit mode:
limits are ignored
base addresses are ignored
… except for %fs, %gs
- when explicit segment override is used
effectively: extra pointer register

thread-local storage

Linux, Windows use %fs, %gs for thread-local storage
variables that have different values in each thread
e.g. for program using multiple cores
to track different values for each core

TLS example (read) (C)

#include <threads.h>
thread_local int thread_local_value = 0;
int get_thread_local() {
    return thread_local_value;
}

0000000000001149 <get_thread_local>:
    1149:       f3 0f 1e fa             
                    endbr64 
    114d:       64 8b 04 25 fc ff ff ff
                    mov    %fs:0xfffffffffffffffc,%eax
    1155:       c3  
                    retq

TLS off    0x0000002df0 vaddr 0x0000003df0 paddr 0x0000003df0 align 2**2
    filesz 0x0000000000 memsz 0x0000000004 flags r--

TLS example (write) (C)

#include <threads.h>
thread_local int thread_local_value = 0;
void set_thread_local(int new_value) {
    thread_local_value = new_value;
}

0000000000001156 <set_thread_local>:
    1156:       f3 0f 1e fa 
                    endbr64 
    115a:       64 89 3c 25 fc ff ff ff
                    mov    %edi,%fs:0xfffffffffffffffc
    1162:       c3
                    retq

Linux x86-64 calling convention (1)

title page of System V Appication Binary Interface, AMD64 Architecture Processor Supplement, Draft Version 0.99.7, edited by Michael Matz, Jan Hubička, dated November 17, 2014

Linux x86-64 calling convention (2)

Beginning of section 3.2 of the document from the previous page, which is title 'Function Calling Sequence'. It starts 'This section describes the standard function calling sequence, including stack frame layout, register usage, parameter passing and so on.'

Linux x86-64 calling summary

first 6 arguments: %rdi, %rsi, %rdx, %rcx, %r8, %r9
- floating point arguments: %xmm0, %xmm1, etc.
additional arguments: push on stack
return address: push on stack
- call, ret instructions assume this
return value: %rax

calling convention example

int foo(int a, int b, int c, int d, int e, int f, int g, int h);
...
foo(1, 2, 3, 4, 5, 6, 7, 8);

pushq   $8
pushq   $7
movl    $6, %r9d
movl    $5, %r8d
movl    $4, %ecx
movl    $3, %edx
movl    $2, %esi
movl    $1, %edi
call    foo
/* return value in %eax */

the call stack

foo(a,b,c,d,e,f,g,h);

Diagram of stack contents, with addresses decreasing from top to bottom. In order from the top: ..., stack allocations in caller, saved registers (if any), h, g, return address, first stack allocation in foo, ...

backup slides

floating point calling convention

use %xmm registers in order

note: variadic functions

variable number of arguments
- printf, scanf, …
- see man stdarg

same as usual
… but %rax contains number of %xmm used

vector instructions

%xmm (and related) regisers support vector instructions

numbers: .float 1 .float 2 .float 3. float 4
ones:    .float 1 .float 3 .float 5 .float 7
result:  .float 0 .float 0 .float 0 .float 0
...
movps numbers, %xmm0
movps ones, %xmm1
addps %xmm1, %xmm0
movps %xmm0, result
/* result contains: 1+1=2,2+3=5,3+5=8,4+7=11 */

{.absolute top=“0%” left=“0%” width=1050 height=600 .my-center .fragment .fade-in-then-out fragment-index1} {.absolute top=“0%” left=“0%” width=1050 height=600 .my-center .fragment .fade-in-then-out fragment-index2} {.absolute top=“0%” left=“0%” width=1050 height=600 .my-center .fragment .fade-in-then-out fragment-index3} {.absolute top=“0%” left=“0%” width=1050 height=600 .my-center .fragment .fade-in-then-out fragment-index4}

x87 floating point stack

x87: 8 floating point registers
- %st(0) through %st(7)
arranged as a stack of registers
example: fld 0(%rbx)

before

after

st(0)

5.0

(value from memory at %rbx)

st(1)

6.0

5.0

st(1)

7.0

6.0

…

…

…

st(6)

10.0

9.0

st(7)

11.0

10.0

x87

not going to talk about x87 more in this course
essentially obsolete with 64-bit x86

exploring assembly

compiling little C programs looking at the assembly is nice:
gcc -S
- extra stuff like .cfi directives (for try/catch)

or disassemble:
gcc -c file.c (or make an executable)
objdump -dr file.o (or on an executable)
- d: disassemble
- r: show (non-dynamic) relocations

assembly without optimizations

compilers do really silly things without optimizations:

int sum(int x, int y) { return x + y; }

sum:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    -4(%rbp), %edx
    movl    -8(%rbp), %eax
    addl    %edx, %eax
    popq    %rbp
    ret

instead of gcc -O version:

sum:
    leal (%rdi,%rsi), %eax
    ret

caller-saved registers

functions may freely trash these
return value register %rax
argument registers:
- %rdi, %rsi, %rdx, %rcx, %r8, %r9
%r11
MMX/SSE/AVX registers: %xmm0-15, etc.
floating point stack: %st(0)-%st(7)
condition codes (used by jne, etc.)

callee-saved registers

functions must preserve these
%rsp (stack pointer), %rbp (frame pointer, maybe)
%r12-%r15

caller/callee-saved

foo:
    pushq %r12 // r12 is caller-saved
    ... use r12 ...
    popq %r12
    ret

...
other_function:
    pushq %r11 // r11 is caller-saved
    ...
    callq foo
    popq %r11

addressing modes (1)

AT&T %reg
Intel REG
AT&T $constant
Intel constant
AT&T displacement(%base, %index, scale)
Intel [base+index*scale+displacement]
- displacement (absolute)
- displacement(%base)
- displacement(,%index, scale)

addressing modes (2)

AT&T jmp *%rax
Intel jmp RAX
- jmp to address specified by RAX
AT&T jmp *(%rax)
Intel jmp [RAX]
- read value from memory at RAX
- PC becomes location in that value
AT&T jmp *(%rax,%rbx,8)
Intel jmp [RAX+RBX*8]

where is the jump?

0xA0000: lea 0x1234(%rip), %rax  # 0xA123b
    (Intel syntax: LEA RAX, [RIP + 0x1234])
0xA0007: add %rbx, %rax # (Intel syntax: ADD RAX, RBX)
0xA000A: jmp *(%rax) # (Intel syntax: JMP [RAX])
...
0xA123B: 0xB0000 (64-bit value)
0xA1243: 0xC0000
...
0xB0000: 0xD0000
0xB0008: 0xE0000
0xB0010: 0xF0000
...
0xC0000: 0x90000

If %rbx initially contains 0x8, then the instruction executed after the jump is at address \rule{1cm1pt}.

	before	after
`st(0)`	5.0	(value from memory at `%rbx`)
`st(1)`	6.0	5.0
`st(1)`	7.0	6.0
…	…	…
`st(6)`	10.0	9.0
`st(7)`	11.0	10.0