Intro RTL Creation Interface Assembly Language Interface VPO Code Generation Interface

An Interface to Assembly Languages

This document describes a generic, machine-independent interface to assembly languages. The representations of instructions, labels, and relocatable addresses are left abstract. We expect that a particular compiler will select fixed implementations for labels and relocatable addresses, and that the representation of instructions will change with every machine.

In fact, we hope this interface will have at least three implementations for every target machine: emit assembly language, emit object code, and help emit RTLs for vpo. [The interface may also be used within vpo, which will then emit assembly language or object code.] Therefore, this interface does not attempt to be the best possible interface for assembly; rather, it defines a plausible interface that is consistent with existing assemblers.

The design of the interface is similar to that of the New Jersey Machine-Code Toolkit's library. We imagine that implementations supporting binary emission would in fact use the Toolkit. Nonetheless, this interface attempts to be independent of the Toolkit.

The assembly interface is exported as a struct assembler, making it easy to supply multiple implementations in one compiler. Every assembler is automatically provided with a symbol table. Other components may be added to implementations by type extension.

<asm.h>=
<assembly interface types>
typedef struct asm_symtab  *AsmSymtab;

typedef struct assembler {
  AsmSymtab symtab;             /* symbol table */
  <assembly interface procedures>
} *Assembler;

<common assembly prototypes>
Defines AsmSymtab, Assembler (links are to index).

Tables [->], [->], and [->] summarize the types and functions exported by this interface.

What the assembler does

Like most assemblers, this one operates with a collection of named ``sections,'' one of which is the ``current section.'' A section identifies a contiguous block of memory in the running process image. Typically, the location at which the section is mapped is not known until link time.

Much of the assembler's job is determining the contents of the sections. Each section has a ``location counter,'' which identifies a current location. The location counter is an integer, and the current location is the location at that offset from the beginning of the current section. Many procedures in this interface deposit data (or instructions) at the current location and advance the location counter.

The assembler also uses names to refer to constants or to locations within sections. To support separate compilation, names may include references to locations defined in other compilations; such references are resolved at link time.




TypeAbbreviation     Meaning
AsmLabellabelA label.
AsmRelAddrrelAddrA relocatable address.
AsmScopescopeA scope for names (imported, exported, local, or common).
AsmSymbolsymAn assembly-language symbol.
AsmInstructioninstrA machine instruction.
AsmSymtabsymtabA symbol table.
AssemblerassemblerAn assembler.
Types used in this interface [*]

Relocatable addresses
Asm_newaddr (label l, int offset)Create address L+k.
Asm_shiftaddr(relAddr a, int offset)Create address a+k.
Symbol-table support
Asm_symtab (void)Create new symbol table.
Asm_sym_insert (Symtab, const char *, scope)Add a symbol.
Asm_sym_lookup (Symtab, const char *, scope default_scope)Look up a symbol; add if not present.
Asm_symreloc (Assembler, const char *)Find relocatable address corresponding to name.


Other functions defined in this interface [*]



Symbols and names
import(const char *)Return symbol for imported name.
export(const char *)Return symbol for exported name.
local (const char *)Return symbol for local (private) name.
common(const char *name, int size, int align, const char *section)Return common symbol.
lookup(const char *)Look up symbol by name.
offset(const char *, sym, int)Create a symbol relative to another symbol (deprecated).
define_symbol_here (sym)Bind the symbol to the current location (e.g., define label).
define_symbol_const(sym, int)Bind a symbol to a constant.
function(sym)Start a function definition.
Sections and the location counter
section(const char *)Change sections.
current_section(void)Return the name of the current section.
org(unsigned)Set the location counter to the argument.
align(unsigned n)Round the location counter to an n-byte boundary.
addlc(unsigned n)Add n to the location counter.
Emitting values and instructions
emit_zeroes(unsigned n)Write n zero bytes.
emit_instruction(instr i)Emit an instruction.
emit(long value, int width)Emit value (width bytes wide).
emita(AsmRelAddr)Emit a relocatable address.
emitf32(int sign, int exp, unsigned long mantissa)Emit a 32-bit IEEE float.
emitf64(int sign, int exp, unsigned long mhi, unsigned long mlo)Emit a 64-bit IEEE float.
emitf32s(const char *)Emit a 32-bit IEEE float (from string).
emitf64s(const char *)Emit a 64-bit IEEE float (from string).
Miscellaneous
progbeg(struct assembler *, int argc, const char **argv)Initialize the assembler.
progend(void)Finalize the assembler.
comment(const char *)Insert a comment.
asmtext(const char *)Insert arbitrary text into the assembly language (deprecated).

Functions accessible indirectly through the assembler structure [*]


Labels and relocatable addresses

Labels and relocatable addresses both resolve to integers at link time. Labels, but not relocatable addresses, can be bound to locations or to values. The reason for distinguishing labels and relocatable addresses is that labels can be bound to a location, but relocatable addresses cannot. Relocatable addresses can nevertheless appear as operands to many machine instructions, so it is appropriate to use them in RTLs.

<assembly interface types>= (<-U) [D->]
typedef struct asm_label   *AsmLabel;
typedef struct relAddr_st  *AsmRelAddr;
Defines AsmLabel, AsmRelAddr (links are to index).

In most assemblers, a relocatable address is:

In this assembler, a relocatable addresses is equivalent to a label plus a constant, which we normally write L+k. [If you think you need the more general version, perhaps to generate position-independent jump tables, please write to zephyr-investigators@virginia.edu.] Addresses can be created relative to some label or relative to an existing address.

<common assembly prototypes>= (<-U) [D->]
AsmRelAddr Asm_newaddr  (AsmLabel   l, int offset);
AsmRelAddr Asm_shiftaddr(AsmRelAddr a, int offset);
Defines Asm_newaddr, Asm_shiftaddr (links are to index).

Note that labels are not created directly; instead they are part of assembly-language symbols, as detailed below.

Names and symbols

The assembler works with a single name space. Names can be imported, exported, common (FORTRAN-style), or local to the compilation unit. Every name is associated with a relocatable address. Except in the special case of offset, the k part of the relocatable address is zero, so the label part is directly associated with the name. Imported labels are unbound. Exported and local labels are bound either to locations in relocatable blocks or to integers.

<assembly interface types>+= (<-U) [<-D->]
typedef enum asm_scope { 
  ASM_IMPORTED=1, ASM_EXPORTED, ASM_LOCAL, ASM_COMMON 
} AsmScope;
typedef struct asm_symbol {
  AsmScope scope;
  const char *name;    /* name by which known to the assembler */
  AsmRelAddr relAddr;  /* usually with offset k == 0 */
  union {
    struct { int size; short align; } common;
  } u;
} *AsmSymbol;
Defines AsmScope, AsmSymbol (links are to index).

Note that additional information is associated with common symbols.

Symbols must not be created directly by the user, but only through the procedures provided in this interface.

<assembly interface procedures>= (<-U) [D->]
AsmSymbol (*import)(const char *);
AsmSymbol (*export)(const char *);
AsmSymbol (*local) (const char *);
AsmSymbol (*common)(const char *name, int size, int align, const char *section);
Defines common, export, import, local (links are to index).

It is an unchecked runtime error to register the same name in different scopes. Multiple calls to import with the same name are OK. It is not determined whether implementations can handle multiple calls of export or local with the same name.

A common symbol may be declared in multiple compilation units, with multiple sizes and alignments. The linker reserves an area with the largest size and the most strict alignment, and the symbol is bound to the address of that area. The area is reserved in the section specified in the common directive; it is an unchecked (link-time) error to declare a common symbol in different sections. Some assemblers or linkers may restrict the sections in which common symbols may be declared, and some linkers may require that the same size and alignment be used in all declarations of a common symbol. Consult the Processor Supplement for information about restrictions.

Symbols that have been registered can be looked up. It is a checked runtime error to look up an unregistered symbol.

<assembly interface procedures>+= (<-U) [<-D->]
AsmSymbol (*lookup)(const char *);
Defines lookup (links are to index).

lcc uses an unusual convention for relocatable addresses of the form L+k; it represents them as symbols. So as to touch the lcc back ends as little as possible, we make it possible to create a new symbol that represents an offset from an existing symbol. Such symbols have no labels associated with them. To create p=L+k, we call offset(p, L, k).

<assembly interface procedures>+= (<-U) [<-D->]
AsmSymbol (*offset)(const char *, AsmSymbol, int);
Defines offset (links are to index).

offset is deprecated and may be removed from a future version of this interface.

Local symbols can be defined to point at the current location in the current relocatable block, or to be constants.

<assembly interface procedures>+= (<-U) [<-D->]
void (*define_symbol_here )(AsmSymbol);
void (*define_symbol_const)(AsmSymbol, int);
Defines define_symbol_const, define_symbol_here (links are to index).

It is an unchecked runtime error to define the same symbol twice.

Procedures

It may be useful, especially when generating MIPS assembly code, to announce the beginnings of procedures. (Why not also the ends? Why not delete this function from the interface? What about textual assembly languages that require register-save masks and similar goo? Perhaps a better strategy is to require machine-specific extensions to the interface?)

<assembly interface procedures>+= (<-U) [<-D->]
void (*function)(AsmSymbol);
Defines function (links are to index).

Sections

The semantic model of a section is that of a relocatable block as defined by the New Jersey Machine-Code Toolkit, which is roughly a sequence of bytes plus the location counter. This interface is substantially simpler than the Toolkit, however, because it does not provide for examining the contents of a section, but only for writing them.

Sections are referred to by name. section switches to a given section, and current_section returns the name of the current section. The exact set of valid section names is determined by the target machine and OS; it is documented in the Processor Supplement. Most targets are likely to support "text" for code and "data" for initialized data.

<assembly interface procedures>+= (<-U) [<-D->]
void (*section)(const char *);
const char *(*current_section)(void);
Defines current_section, section (links are to index).

The location counter, lc, is a nonnegative offset into a section, measured in bytes. The location counter is considered part of the section, so section saves the current location counter and restores the proper one for the new section.

The location counter of the current section can be manipulated in various ways:

<assembly interface procedures>+= (<-U) [<-D->]
void (*org)(unsigned);           /* set lc to argument */
void (*align)(unsigned n);       /* round lc up to an n-byte boundary */
void (*addlc)(unsigned n);       /* add n to lc */
Defines addlc, align, org (links are to index).

If advancing the location counter results in unwritten areas in a section, the contents of those areas are undefined. It's also possible to advance the location counter by filling in with zeroes:

<assembly interface procedures>+= (<-U) [<-D->]
void (*emit_zeroes)(unsigned n); /* write n zero bytes */
Defines emit_zeroes (links are to index).

Arguably it ought to be possible to get the current value of the location counter, but most assemblers don't let you store and reuse such a value, which makes a C interface problematic. Luckily, the ability to create and define labels makes such an interface unnecessary.

Instructions

emit_instruction emits an instruction at the current location. The definition of struct asm_instruction is machine-dependent and not part of this interface. (In fact, an application like a compiler might use this interface with more than one representation of instruction, in which case casting would be required.) A (machine-dependent) definition of struct asm_instruction might be generated automatically from a SLED description of the instruction set. The implementation of emit_instruction guarantees that *i does not outlive the activation of emit_instruction(i), so it is permissible---and recommended---to pass the address of a local variable.

<assembly interface procedures>+= (<-U) [<-D->]
void (*emit_instruction)(AsmInstruction i);
Defines emit_instruction (links are to index).

<assembly interface types>+= (<-U) [<-D]
typedef struct asm_instruction *AsmInstruction;
Defines AsmInstruction (links are to index).

Values

Integers

We can emit single integers up to the largest width supported. The width is given explicitly in bytes. We can also emit relocatable addresses in the natural pointer size of the target machine.

<assembly interface procedures>+= (<-U) [<-D->]
void (*emit)(long value, int width);
void (*emita)(AsmRelAddr);
Defines emit, emita (links are to index).

Note that if it is desired to support cross-assembly to a machine with a larger word size than the host machine, large constants will have to be emitted in pieces. This requirement should not represent an undue burden because large constants will have to be represented using multiple host words anyway.

Floats

Since we take cross-compilation as routine, we can't pass host floating-point values. Instead, we can encode floating-point values as sign, exponent, mantissa, or as ASCII. Passing the mantissa may require two words, mlo for the least significant 32 bits, and mhi for the remaining most significant bits.

<assembly interface procedures>+= (<-U) [<-D->]
void (*emitf32) (int sign, int exp, unsigned long mantissa);
void (*emitf64) (int sign, int exp, unsigned long mhi, unsigned long mlo);
void (*emitf32s)(const char *);
void (*emitf64s)(const char *);
Defines emitf32, emitf64, emitf32s, emitf64s (links are to index).

These functions emit IEEE 754 floating-point values of 32 and 64 bits. Compilers wishing to emit infinities or NaNs must use emit to emit the binary representation.

Initialization and finalization

These two functions are called at the beginning and end of assembly, respectively. argc and argv may be used to pass information that is machine-dependent or implementation-dependent. [I know of no such use at present, but these kinds of escapes have proven useful in the past.]

<assembly interface procedures>+= (<-U) [<-D->]
void (*progbeg)(struct assembler *, int argc, const char *argv[]);
void (*progend)(void);
Defines progbeg, progend (links are to index).

Comments

This function may be used to attempt to insert a comment into the output. Only implementations that emit ASCII assembly language are likely to succeed in the attempt; all implementations are free to ignore calls to comment. It is an unchecked runtime error for a comment to contain a newline, line feed, form feed, etc.

<assembly interface procedures>+= (<-U) [<-D->]
void (*comment)(const char *);
Defines comment (links are to index).

Escape hatch

This escape hatch can be used to emit textual assembly language directly. Implementations not based on textual assembly language (e.g., binary emitters) may ignore this information entirely, or they may try to glean something from the strings. Use of this directive is deprecated.

<assembly interface procedures>+= (<-U) [<-D]
void (*asmtext)(const char *);
Defines asmtext (links are to index).

Symbol-table support

We export a simple symbol-table implementation, which should be useful in every machine-dependent assembler. Asm_symtab creates a new symbol table. Asm_sym_insert inserts a symbol, complaining if it already exists with a different scope. Asm_sym_lookup seeks a symbol, inserting it with the default_scope if it doesn't exist. There's no such thing as an undefined symbol until you reach the linking stage!

<common assembly prototypes>+= (<-U) [<-D->]
extern AsmSymtab Asm_symtab (void);
extern AsmSymbol Asm_sym_insert (AsmSymtab, const char *, AsmScope);
extern AsmSymbol Asm_sym_lookup (AsmSymtab, const char *, AsmScope default_scope);
Defines Asm_sym_insert, Asm_sym_lookup, Asm_symtab (links are to index).

For convenience only, we provide a generic routine for mapping names to relocatable addresses by looking up names in the assembler's symbol table.

<common assembly prototypes>+= (<-U) [<-D]
extern AsmRelAddr Asm_symreloc (Assembler, const char *);
Defines Asm_symreloc (links are to index).

Index of identifiers

List of code chunks

Intro RTL Creation Interface Assembly Language Interface VPO Code Generation Interface