<!-- this file was generated automatically by noweave; better not edit it-->

SYM: BDC Symbol Table and Identifiers

Jack Davidson Jonathan Hill

Department of Computer Science
University of Virginia
Charlottesville, VA

Table of Contents

Introduction

This document discusses basic concepts of identifier recognition, classification, and storage.

Our compiler will maintain a simple symbol table, adequate for the purposses of our toy C language. For our VPOi version of the compiler, some symbol table activities can be handled by the VPOi-ASM symbol table functionality. However, we will still utilize our symbol tables in tandem, to minimize the effective changes to our compiler code in porting our output from assembly language to RTLs.

The identifier (symbol) concept is defined in this module, and so we discuss identifiers as defined by our toy language within this document.

This module contains two basic ADT's. A string table, and an identifier table. Both are hash table implementations. The identifier table (symbol table) relies on the string table ADT. We will discuss each separately.

Identifier Concepts

Scope and Block Level

Scope

Scope is generally the nesting depth of a data definition. In our toy-c language, there are only three scopes. Therefore, we do not maintain an integer scope, instead having three distinct, non-enumerated scopes. These are local, global, and parameter scope.
<defines>= (U->) [D->]
/* scopes *** do not rearrange *** */
#define LOCAL 0
#define PARAM 1
#define GLOBAL 2
Defines GLOBAL, LOCAL, PARAM (links are to index).

Global scope is given to any variable declared outside all function definitions. Local scope is given to a variable declared at the begining of a function body definition (as in C.) Parameter scope is given to formal parameters of a function definition.

Block Level

Block level is distinct from scope in that it is an integer representing the nesting depth itself. Global scope is at block level 2. Formal and actual parameters are at block level 3, and block level 4 is the local function block level.

As in C, in the case of a local variable sharing a name with a global variable, the local definition overrides the global within the function body. Parameters and locals should not have identical names.

Block level is maintained by a global variable in sym.c.

<blocklevel>= (U->)
int level;                              /* current block level */
Defines level (links are to index).

Whenever we enter or leave a block, we call an associated function.

<prototypes>= (U->) [D->]
void enterblock();
void leaveblock();

Enterblock simply increments the block level

<functions>= (U->) [D->]
/*
 * enterblock - enter a new block
 */
void enterblock()
{
   level++;
}
Defines enterblock (links are to index).

while leaveblock is responsible for first clearing the symbol table of any entries associated with the block level we are exiting

<functions>+= (U->) [<-D->]
/*
 * leaveblock - exit a block
 */
void leaveblock()
{
   IDENT **i, *p, *tmp;

   if (level > 0) {
      for (i = id_table; i < &id_table[ITABSIZE]; i++) {
         for (p = *i; p; p = tmp)
            if (p->i_blevel < level)
               break;
            else {
               tmp = p->i_link;
               cfree (p);
               }
         *i = p;
         }
      level--;
      }
}
Defines leaveblock (links are to index).

before decreasing the block level. These functions will be met in our grammar (cgram and cgram-vpoi) as well as our semantic analysis procedures (sem and sem-vpoi).

A discussion of the symbol table data structure appears in a later section of this document.

Data type

Data type represents traditional C-like data type, as well as lesser attributes of that type. There are two essential data types and several attributes a symbol may represent.
<defines>+= (U->) [<-D]
/* internal types *** do not rearrange *** */
#define T_INT   1               /* integer */
#define T_DOUBLE (1<<1)         /* double  */
#define T_PROC (1<<2)           /* procedure */
#define T_ARRAY (1<<3)          /* array */
#define T_PTR (1<<4)            /* pointer */

Defines T_ARRAY, T_DOUBLE, T_INT, T_PROC, T_PTR (links are to index).

A double is a double-word floating point. An integer is a single word integral value. These are the two allowed standard data types. A symbol might also represent an array, a procedure name (where the data type is the return type) and a pointer. A pointer may refer to a string, or a reference parameter, etc. Our compiler assumes character data to be of integer type. Literal strings are effectively handled by the assembly interface (discussed in 'sem' documentation.)

In general we will encounter these defines in a data type known as a TWORD (type word) which is equivalent to an integer type.

<typedefs>= (U->) [D->]
typedef unsigned int TWORD;
Defines TWORD (links are to index).

The String Table ADT

This section describes the string table, which is utilized by the identifier table ADT. The string table ADT stores unique strings in a hash table.

Data Structure

The string table data structure is as follows:
<stringtableStructure>= (U->)
struct s_chain {
   char *s_ptr;
   struct s_chain *s_next;
   } *str_table[STABSIZE] = {0};                /* string hash table */

This is a simple array table of linked list. Each member of a linked list is a string that has collided in the same hash table array entry. It is declared in sym.c as a global data structure, as can be seen from the trailing global definition in the structure definition.

The hash table size is set by the following macro definition:

<stabsize>= (U->)
# define STABSIZE 119                   /* hash table size for strings */

Functions

The symbol table has two functions. The first allows us to lookup a string in the table.
<prototypes>+= (U->) [<-D->]
char *slookup(char str[]);
Defines slookup (links are to index).

If the string is not found, it is installed in the table. In either case, a pointer to the string in the table is returned.

<functions>+= (U->) [<-D->]
/*
 * slookup - lookup str in string table, install if necessary, return ptr 
 */
char *slookup(char str[])
{
   struct s_chain *p;
   int i, k;
   
   for (k = i = 0; i < 5; i++)  /* simple hash function */
      if (str[i])
         k += str[i];
      else
         break;
   
   k %= STABSIZE;
   for (p = str_table[k]; p; p = p->s_next)
      if (strcmp (str, p->s_ptr) == 0)
         return (p->s_ptr);
   p = (struct s_chain *) alloc (sizeof(struct s_chain));
   p->s_next = str_table[k];
   str_table[k] = p;
   p->s_ptr = (char *) alloc ((unsigned) strlen (str) + 1);
   p->s_ptr = strcpy (p->s_ptr, str);
   return (p->s_ptr);
}

Defines slookup (links are to index).

The function begins by using the sum of the ASCII for the string as a hash function. This is then set to an index into the string table size, by modulus, and the string is searched for on the linked list at that hash table entry. If the string is not found, a place for it is allocated at the head of the string node linked list, where it is inserted. The address of the string in the node is returned. If the string was previously in the table, the address of the string in the node in which it was found is returned.

The second function allows us to store the string table to a file.

<prototypes>+= (U->) [<-D->]
void sdump(FILE *);

The internals of sdump are trivial.

<functions>+= (U->) [<-D->]
/*
 * sdump - dump string table to f
 */
void sdump(FILE *f)
{
   struct s_chain **s, *p;

   fprintf (f, "Dumping string table\n");
   for (s = str_table; s < &str_table[STABSIZE]; s++)
      for (p = *s; p; p = p->s_next)
         fprintf(f, "%s\n", p->s_ptr);
}
Defines sdump (links are to index).

Symbol Table (Identifier Table) ADT

The symbol table maintains type, scope, block level, and width data for each separate identifier. Notice that string and block level make a distinct data entry in the symbol table. A string alone may be used by many identifier entries (signifying variables of differing block level with the same name.)

Data Structure

The symbol table data structure follows:
<identStructure>= (U->)
/* symbol table entry */
  
struct id_entry {  
   IDENT *i_link;               /* pointer to next entry on hash chain */
   char *i_name;                /* pointer to name in string table */
   TWORD i_type;                /* type code */
   int i_blevel;                /* block level */
   int i_defined;               /* non-zero if identifier is defined */
   int i_width;                 /* number of words occupied */
   int i_scope;                 /* scope */
   int i_offset;                /* offset in activation frame */
  /*...added for vpoi...*/
   AsmSymbol i_sym;
   };

The identifier node, pointed to by i_link, stores a linked list of identifiers that have collided on the particular identifier array entry. Then name is a character pointer. It points to the string as stored in the string hash table. The addition of a block level entry make the identifier unique. Type and scope information are also stored. We store the size of the data type in bytes in the i_width field.

The offset field is used to store the stack offset of a symbol that is of parameter or local scope. Stack offset is given at compile time in the semantic routines. See ``sem'' documentation for details on offset calculation.

Change for VPOi

The final component, that of a VPOi-ASM symbol table node allows us to utilize our symbol table ADT, but store VPOi-ASM required symbol structures for the VPOi version of this compiler.

Although we will not ever query VPOi's symbol table, we will need to register each of our symbols with it in order for VPOi to output the correct RTLs. For details of VPOi storage of symbols see ``sem'' documentation. For details of usage of VPOi symbol information, see ``cgen'' documentation.

Its More Common Name

The id_entry data structure is much more commonly referred to as the IDENT node structure, as the following typedef provides:
<typedefs>+= (U->) [<-D]
typedef struct id_entry IDENT;

Functions

Our symbol table is designed to handle languages with local variables declared within nested blocks recursively, although we will not encounter such recursiveness in the C-like grammar of our toy language.

The lookup function

<prototypes>+= (U->) [<-D->]
IDENT *lookup(char *, int);
Defines lookup (links are to index).

returns the pointer to the symbol table entry node that identifies a symbol with the string and block level passed as arguments to the function. If no such identifier exists, this function returns a null pointer.

<functions>+= (U->) [<-D->]
/*
 * lookup - lookup name, return ptr; use default scope if blev == 0
 */
IDENT *lookup(char *name, int blev)
{
   IDENT *p;

   for (p = id_table[hash (name) % ITABSIZE]; p; p = p->i_link)
      if (name == p->i_name && (blev == 0 || blev == p->i_blevel))
         return (p);
   return (NULL);
}
Defines lookup (links are to index).

The install function

<prototypes>+= (U->) [<-D->]
IDENT *install(char *, int);
Defines install (links are to index).

places a new identifier on symbol table. We pass the string and block level of the symbol to the symbol table. The function returns the address of the IDENT node for the symbol in the hash table. Passing a negative block level to the function is an indication to assign the global block level to the identifier.

<functions>+= (U->) [<-D->]
/*
 * install - install name with block level blev, return ptr 
 */
IDENT *install(char *name, int blev)
{
   IDENT *ip, **q;
   
   if (blev < 0)
      blev = level;
   ip = (IDENT *) alloc (sizeof(IDENT));
   ip->i_name = name;           /* set fields of symbol table */
   ip->i_blevel = blev;
   for (q = &id_table[hash (name) % ITABSIZE]; *q; q = &((*q)->i_link))
      if (blev >= (*q)->i_blevel)
         break;
   ip->i_link = *q;
   *q = ip;
   return (ip);
}

Defines install (links are to index).

The identifier node will inserted in a collision linked list in order of decreasing block level. Duplicate entries are not checked for, save that they will be stored consecutively.

A function to store a symbol table to a file also exists.

<prototypes>+= (U->) [<-D->]
void dump(int, FILE *);

The symbol table will dump all symbols with block level greater than an integer parameter to a File specified by parameter.

<functions>+= (U->) [<-D->]
/*
 *  dump - dump identifiers with block level >= blev to f
 */
void dump(int blev, FILE *f)
{
   IDENT **i, *p;

   fprintf (f, "Dumping identifier table\n");
   for (i = id_table; i < &id_table[ITABSIZE]; i++)
      for (p = *i; p; p = p->i_link)
         if (p->i_blevel >= blev)
            fprintf (f, "%s\t%d\t%d\t%d\n", p->i_name, p->i_blevel, 
             p->i_type, p->i_defined);
}
Defines dump (links are to index).

The Global Symbol Table

The size of the hash table is given by a macro definition.
<identtabsize>= (U->)
# define ITABSIZE 37                    /* hash table size for# identifiers */

We declare a global data structure of identifier node entries to be our identifier hash table

<globals>= (U->)
IDENT *id_table[ITABSIZE] = {0};        /* identifier hash table */
Defines id_table (links are to index).

Miscellanious Functions

The hash table function utilized by the identifier table,

<prototypes>+= (U->) [<-D->]
int hash(char *);

, uses the address of the string as a hash function.

<functions>+= (U->) [<-D->]
/*
 * hash - hash name, turn address into hash number
 */
int hash(char *s)
{

   return((int ) s);
}   
Defines hash (links are to index).

simply converts the strings pointer to an integer and returns it.

Our code provides an allocation function

<prototypes>+= (U->) [<-D]
char *alloc(unsigned);
Defines alloc (links are to index).

which will return a character pointer from a call to the calloc routine.

<functions>+= (U->) [<-D]
/*
 * alloc - alloc space
 */
char *alloc(unsigned n)
{
   char *p;

   if ((p = calloc (1, n)) == NULL) {
      yyerror ("csem: out of space");
      exit (1);
      }
   return (p);
}
Defines alloc (links are to index).

Includes

The symbol table Functions will require access to functions and data structures from the following files:
<includes>= (U->)
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "asm.h"
#include "rtl.h"
#include "sym.h"

Note that among these, sym.h is generated from this documentation file.

Code Files in this Document

This document holds the code for sym.c and sym.h.

<sym.c>=
/* ==========================================================================
   = Symbol Table ADT
   =   for storing IDENT nodes.
   ==========================================================================
*/
<includes>
<stabsize>
<identtabsize>
<blocklevel>
<stringtableStructure>
<globals>
<functions>
<sym.h>=
/*=========================================================================
  = IDENT info 
  =  and Symbol Table ADT 
  =     for string IDENT nodes.
  =========================================================================
*/ 
#ifndef SYMH
#define SYMH
<defines>
<typedefs>
<identStructure>
<prototypes>
#endif 
Defines SYMH (links are to index).