Assignment: RE

(Edit 30 Jan 2017: clarify what is required for question 9.)

Purpose

The purpose of this assignment is to help understand the process of reverse engineering and refresh students’ knowledge of x86 assembly.

Materials to Review

You should understand the basic operation of x86 assembly language and the call stack. You may also find materials from 2150, this guide for 64-bit x86 assembly from CMU and this guide for 32-bit x86 assembly helpful.

A simple C program:

Consider the following two file C program:

foo1.c:

#include <stdio.h>
#define BUFSIZE 12
int foo(char vector[], int len, int value);
int main() {
   int i, sum, x;
   char  buffer[BUFSIZE];
   x = foo(buffer, BUFSIZE, 5);
   sum = 0;
   for (i = 0; i < BUFSIZE; i++)
      sum += buffer[i];
   printf ("Sum is %d\n", sum);
   return 0;
}

foo2.c:

int foo(char vector[], int len, int value) {
   int i;
   for (i = 0; i < len; i++)
      vector[i] = value;
   return len;
}

A result of compiling this with gcc -O2 foo1.c foo2.c, then disassemblying the result with objdump -sRrd is included in foo-disasm.txt. See below for some notes about interpreting this output format.

Task

Look at the dump of the information in the executable and answer the following questions. When asked for an address, provide the form used in the assembly to access the value, e.g 0x18(%rsp) or (%rax). If a value is stored directly in a register, indicate that register instead. If a value is both located in registers and in memory, provide both locations. If a value was eliminated by optimizations, say so.

  1. What is the address or register of the local variable i in main()?
  2. What is the address or register of the local variable sum in main()?
  3. What is the address or register of the local variable x in main()?
  4. What is the address or register of the local variable buffer in main()?
  5. What is the address or register of the parameter len in foo()?
  6. What is the address or register of the parameter value in foo()?
  7. What is the address or register of the local variable i in foo()?
  8. What needs to happen for the jne at address 0x40052f to jump to 0x400536?
  9. __libc_start_main is passed several (constant) addresses. What do these appear to represent? (It is sufficient to identify what things they are addresses of.)

Put your answers in a text file and submit it on Collab.

Notes on interpreting the objdump file

General format

The objdump output we provide contains several parts corresponding to several parts of the executable, which are described in more detail below:

  1. Information about the type of executable file. This indicates what architecture it is for, that it is in Linux’s ELF format, and the address at which execution of the program starts.
  2. The actual contents of each “section” of the executable that will be loaded into memory. The sections have names like “.text” and “.dynstr” depending on their purpose.

    These will look something like:

    Contents of section .text:
     4004d0 4883ec28 ba050000 00be0c00 00004889  H..(..........H.
     4004e0 e764488b 04252800 00004889 44241831  .dH..%(...H.D$.1
     4004f0 c0e83a01 0000488d 74240c48 89e031d2  ..:...H.t$.H..1.
    

    The leftmost column indicates the address (in hexadecimal) where this data will be loaded in memory. The next four columns are the hexadecimal values actually placed in memory. These values are written in the order the bytes appear in memory, so the value 0x12345678 in little endian will appear as 78563412. The final columns are the same values represented as characters, except a period (.) is used to represent bytes which do not correspond to a printable ASCII character.

  3. Disassembled versions of sections that contain executable code.

    These will look something like:

    0000000000400460 <_init>:
      400460:       48 83 ec 08             sub    $0x8,%rsp
      400464:       48 8b 05 8d 0b 20 00    mov    0x200b8d(%rip),%rax        # 600ff8 <_DYNAMIC+0x1d0>
      40046b:       48 85 c0                test   %rax,%rax
      40046e:       74 05                   je     400475 <_init+0x15>
      400470:       e8 3b 00 00 00          callq  4004b0 <__gmon_start__@plt>
      400475:       48 83 c4 08             add    $0x8,%rsp
      400479:       c3                      retq
    

    This indicates that there is a label called _init which has the address 0x400460 when the executable is loaded. Each following line is an instruction. The value before the colon indicates the memory address in hexadecimal of the first byte of the instruction. The hexadecimal values after the colon are the bytes of the instruction in hexadecimal. Following this is the disassembled instruction itself.

    Within the disassembled instructions, objdump attempts to provide information about addresses in addition to showing the addresses encoded in the instruction. In cases where the label is exactly equal to the address, like for the label __gmon_start__@plt in the example above, the format is address <LABEL> with the address in hexadecimal. In cases where the address does not correspond to a label, the format is address <LABEL+offset>. For example 400475 <_init+0x15> indicates the address 0x400475, which is 0x15 bytes after the label _init.

    On 64-bit x86, some instructions specify an address relative to %rip. %rip represents the “instruction pointer”, which in 2150 and 3330 we have called the “program counter”. It is the address of the current instruction, so 0x200b8d(%rip) means memory 0x200b8d bytes after the address of the current instruction. objdump’s disassembly includes a comment indicating what address is computed. In the case of the example above, the address is 0x600ff8, which is 0x1d0 bytes after the label _DYNAMIC.

Note that this is not all the information in the executable and not all the information that objdump is capable of providing.

On dynamic linking

This executable is dynamically linked, so it doesn’t include code for functions in the C standard library like printf. These are loaded at runtime by the dynamic linker which is contained in /lib64/ld-linux-x86-64.so.2. The way Linux implements dynamic linking involves having this program handle loading all dynamically linked executables as an interpreter.

As part of Linux’s implementation of dynamic linking, there is a Procedure Linkage Table (PLT). This contains “stubs” for each function the executable expects to find in a dynamically linked library, like the C standard library. One of the “stubs” looks like:

00000000004004c0 <__printf_chk@plt>:
  4004c0:       ff 25 6a 0b 20 00       jmpq   *0x200b6a(%rip)        # 601030 <_GLOBAL_OFFSET_TABLE_+0x30>
  4004c6:       68 03 00 00 00          pushq  $0x3
  4004cb:       e9 b0 ff ff ff          jmpq   400480 <_init+0x20>

This stub is called __print_chk@plt and is loaded into the program’s memory at address 0x4004c0. The first instruction in this function reads the address of a function from memory at 0x601030, then jumps to that function. As indicated by the comment added by objdump this address is part of the “global offset table”. This is an array of pointers used to find functions like printf which are loaded every time the executable runs. Using this table allows the same program to work with different implementations of printf, where printf may end up at different locations in memory. For example, in this case the global offset table will eventually contain the address at which __printf_chk, part of the Linux C library’s implementation of printf is loaded into memory.

By default, the values in this global offset table are initialized to point to the instruction following the jump, for example 0x601030 contains 0x4004c6. This means that the first time the “stub” is called, it will “fall through” to the code after the global offset table jump. This code pushes an indicator of what function was called on the stack, then jumps to part of the dynamic linker. (This code is not included in the executable file, and therefore not present in the objdump output.) The dynamic linker will then locate the actual routine (the implementation of __printf_chk in the standard library, in this case) and update the global offset table to contain its address.

On _start

Execution of the program does not actually start in main but starts in a function called _start that is provided by the compiler — this is the start address specified in the program header. This function calls a special function in the C standard library called __libc_start_main. It is this function that actually calls main and takes care of exiting when main returns.

On %fs

x86 has a feature called “segmentation”. As part of this feature, the processor has several “segment registers” which specify a region of memory — essentially the segment register acts as a pointer. %fs:0x28 specifies to use segment register %fs and access a value 0x28 bytes from the beginning of the memory region it identifies.

On Linux, the %fs segment register is used for “thread-local storage” — to point to a block of data particular to a thread, even in a multithreaded process.

On Windows, the %gs segment register is used for something similar.

The use of a segment register for this purpose instead of a normal register is just to make sure as many registers are available to the program as possible.

Segmentation was originally intended to provide functionality similar to virtual memory. These days, it is rarely used for this purpose, and its primary use is to support thread-local storage, as occurs briefly in the assembly in this assignment. It is, still, however, universally present on x86 and is entangled with x86’s implementation of kernel mode and exceptions.