Computer Architecture Study Guide

(Spring 1996) What is Amdahl's law, both the equation and in common terms?

Amdahl's law states that the benefit of speeding up a particular aspect of a computer is limited by the amount of time that it is utilized. In general terms,

Speedupoverall = Execution timeold
-------------
Execution timenew

If the component that is optimized is used Fractionenhanced of the time and the old execution time is sped up by Speedupenhanced, then

Execution timenew = (1 - Fractionenhanced)×Execution timeold + Fractionenhanced×Execution timeold
-------------------------
Speedupenhanced

If we substitute this into the overall speedup equation and cancel Execution timeold, then we get

1
Speedupoverall = --------------------------------
1 - Fractionenhanced + Fractionenhanced
------------
Speedupenhanced

Suppose that 1/n of a program must be executed sequentially, and the rest can be parallelized. What is the best speedup that you can achieve through parallel execution?

Given the fraction of the program that can be parallelized, an infinite amount of parallelization will reduce the computation time to one clock cycle, which will be dominated by the rest of the computation. From Amdahl's law,

1
Speedupoverall Speedup Figure ---------------------
1 - (n-1)/n + (n-1)/n
-------------
Infinity
        1
= --------- Approx. Equal toO(n)
 1 - (n-1)/n

What are three major styles of instruction set architectures?

Early computers used a stack architecture, where the operands are implicitly on the top of the stack and the result of a computation is deposited on to the stack. Accumulator architectures have the accumulator as an implicit operand, and results are stored there. The most common type of architecture today is general purpose register. (The reasons for this are that register memory acts like a fast cache, and compilers have an easier time generating and optimizing code with registers.)


What is the difference between "Little Endian" and "Big Endian"?

These terms refer to the ordering of the two bytes in a word. Little Endian means that the byte with the lower significant bits are stored lower in memory, and Big Endian means that the byte with the higher significant bits are stored lower in memory. PCs are little endian, which means you have to shift right to divide.


How fast does integrated circuit density, DRAM, and disk density grow?

DRAM grows at about the rate of 60% a year. ICs and disks grow at about the rate of 25% a year. Programs need new address bits at about the rate of 1 a year.


Compare and contrast RISC based and CISC based architectures.

Central to the RISC concept is the fact that the top ten instructions (usually loads, stores, and branches) account for two-thirds of all executed instructions. RISC instructions are easier to implement, and have a fixed length that makes instruction decoding easier and faster. RISC architectures typically run more than twice as fast as a similar CISC machine. They usually have a lot of registers, and were designed to optimize code for pipelined execution.

CISC, on the other hand, usually has 20% less instructions for a given piece of compiled code, and in general require half the memory of RISC architectures, both resulting in less cache misses and less bus traffic. They also typically have half as many registers, which makes context switches faster. However, CISC instructions have varying execution times which makes pipelining more difficult.


If an architecture has separate floating point and integer registers instead of a large set of general purpose registers, what are the issues from both the hardware and software viewpoints.?

Probably the biggest benefit is that a superscalar architecture will have no dependences between the floating point and integer registers. The instruction issue hardware may be a little more complicated, as well as the data path in general. Splitting the instruction types also has the benefit of allowing the floating point registers to be larger for higher precision. From the software side, compilers would have to take the restriction into account, which may limit some optimizations of the code.


What are the common addressing modes?


Explain the various methods of encoding an instruction set.

Machines like the VAX allowed a variable length for the instructions, partly as a result of the many different types of instructions that the architecture had. Today, though, most machines (like the SPARC) use a fixed-length encoding, because it makes pipelining easier since the fields of the instruction are always in the same bit locations. Lastly, machines such as the 80x86 line use a hybrid approach in which a limited number of encoding lengths are used. This provides some of the benefits of fixed length encoding, but also provides good code density.


What is microcode? Explain what it does, how it does it, and the benefits of using it.

Microcode is code for a small finite state machine that controls the register file, ALU, and bus by manipulating the control lines of the CPU. By specifying a set of instructions, control gates can be opened and closed in a certain order to cause the machine to execute a program. It is written at a lower level than the usual instructions, and provides the implementation for many of the complicated instructions that can not be hard-wired easily. Another benefit is that optimizations or corrections to an instruction set can be performed after the die has been cast (no pun intended) by downloading and reprogramming the microcode. Likewise, programmers that have access to the microcode can customize the execution of the instructions for a particular application. Lastly, programs written for an older machine can be reused by simulating the old machine's instruction set, albeit at a performance cost. (See Structured Computer Organization, by Andrew Tenenbaum for a good discussion.)

Where is it stored?

It is stored in "micromemory" (or "control store"), which is separate from main memory.

(Spring 1994) What is Horizontal/Vertical/Diagonal microcode?

Horizontal microcode has one bit for each control line. Vertical microcode uses an encoded opcode to represent the instruction. Horizontal microcode is that is allows more parallelism, but is less compact than vertical. Diagonal microcode is a cross between the two in which an instruction is composed of more than one opcode. Horizontal microcode also has the disadvantage that it is hard to drive such a large number of control lines at once, which sometimes results in noise in the signal.

(Spring 1994) Why isn't it in use today?

Today hardwired control is used, since simulators can allow designers to analyze a silicon implementation in software before the chip is made. Also, many modern architectures are RISC-based, where the instructions are simple enough to be implemented in hardware. (Microcode makes implementing ISAs with complex instructions easier, and unused opcodes can be turned into useful instructions in later architectures.) In general, hardware tends to be faster.


(Spring 1995) How is the data path of a CPU controlled?

This answer assumes that "data path" means the flow of data through the CPU. Some people refer to the ALU only as the "datapath"

Control Figure

A CPU has a control unit, which is a finite state machine that takes as input the instruction register, the status register, and the current major state of the cycle. Given this information, the control unit manipulates the control lines, thereby coordinating moving of data, memory access, etc.

In many CISC architectures the control unit has internal registers, a ROM, and internal control logic that form a microcode engine. The microcode in the ROM is used by the microcode engine to cause the CPU instructions to be executed, which includes control of the data path.

Most RISC architectures, however, use hardwired control in the control unit instead of microcode. Hardwired control takes the inputs and runs them through a boolean logic circuit that generates the output.

Other alternatives to implementing the control unit are to create a very simple computer that interprets the inputs as a jump address into a subroutine library, or to use a large read only memory in which the inputs are an address to the memory whose data becomes the control signals.


(Winter 2000) What is the "Von-Neumann Bottleneck"?

The Von-Neumann bottleneck refers to the contention for access to memory that can be caused by an architecture in which instruction and data are read, modified, and written back to memory by a CPU. Typically there is a single data path to memory, which is simply not sufficient for applications which require large amounts of data very quickly.

(Winter 2000) What is the "Flynn Bottleneck"?

The Flynn Bottleneck is the limitation on memory that prevents more than a few instructions to be fetched and decoded in a given clock cycle. It hinders high clock rates and multiple issue of instructions.

(Winter 2000) How do people try to work around memory bottlenecks?

The common technique is to try to bridge the gaps. For instance, between CPU and main memory, one will find one or two high-speed caches. Likewise, one will find a TLB (translation lookaside buffer) in order to speed up the translation of virtual to physical addresses.

(Winter 2000) Assuming you have a cache of infinite size, that has a hit-ratio of 100%. Is the Von Neumann bottleneck still a problem?


What is pipelining?

Pipelining is a technique in which multiple instructions are overlapped in execution. Execution is broken down into several stages, and instructions pass from one stage to the next much like an assembly line in a factory.

What are the major stages of a pipeline?

What is fixed-field decoding?

In a fixed field instruction encoding, the bit locations of the operands are known, so it is possible to do the decoding and register lookup in parallel during the ID stage of the pipeline, thereby saving some time.

What are pipeline latches?

Latches are registers located between pipeline stages, used to pass information between stages. Without them, data would be overwritten in the next stage during advancement of instructions.


What is a pipeline hazard? What kinds of hazards are there?

Pipeline hazards are conflicts between instructions in a pipeline that prevent the next instruction in the stream from executing normally. There are three major types of hazards:

  1. Structural: These result from limitations on the hardware of the system, so that resources can not be used in all possible combinations of pipelined instructions.
  2. Data: These occur when an instruction depends on the result of a previous instruction that is currently being executed further ahead in the pipeline.
  3. Control: These arise from changes to the PC (i.e. jumps, branches, or interrupts), where the next instruction to be executed is unknown.

What are some common ways of reducing structural hazards?

One method of dealing with contention for a resource is to duplicate the resource (e.g. have 2 ALU's or multiple ports to memory). Memory can be a structural hazard, and one can deal with it by splitting the cache into one for instructions and another for data, or by using instruction buffers.

What are some common ways of reducing data hazards?

One common way is to use cycle splitting, in which registers are written in the first half of a cycle and read from during the second half. This allows registers to be read in ID and written to in WB during the same cycle. Another method of reducing data hazards is to do data forwarding (also called short circuiting), in which the result of the EX stage, MEM, or WB stage is sent to another stage in the pipeline that needs it. This method requires a little extra hardware, and can not prevent all data hazards from occurring.

What are some common ways of reducing control hazards?

There are many ways to reduce control hazards; here are a few. One way is to move the zero test and branch target calculations into the ID stage, so that the result of the branch will be known sooner. Another way is to use branch delay slots, in which instructions that would be executed regardless of the outcome of the branch are placed directly after the branch. One can also use a branch prediction buffer to store a few bits of data about the last time the branch was taken, and then speculatively execute the next instruction, backing out if necessary. An optimization on a branch prediction buffer is a branch target buffer, which stores the address of the target of the branch in the prediction buffer, thereby saving a memory reference.


Explain the various types of data hazards.


What is a pipeline interlock?

A pipeline interlock detects hazards by comparing some set of pipeline registers and stalls the pipeline until the hazard has cleared. A common example is a load instruction followed by an instruction that uses the result of the load. An interlock inserts a bubble by recycling the results of the units of the stalled instruction back to their input latches, and insert a NOOP into the pipeline.


How do interrupts interact with pipelining?

Interrupts can be problematic because the pipeline may be full of half-completed instructions that need to be dealt with. The ideal objective is to not modify the state of the machine after the interrupt.

If speculative execution is used, then interrupt raised by speculative instructions should be ignored. Also, poison bits are used to tag registers whose results can not be used because an interrupt occurred while executing speculatively.

Lastly, dynamic scheduling of instructions can become very problematic, since it becomes possible that an instruction is completed before a previous instruction, which then causes an interrupt. Rolling back previously completed instructions in such a case is complex.

What are precise interrupts?

A machine has precise interrupts if the instructions just before an interrupt are completed and the ones after can be restarted from scratch. This is an important quality to have for machines with demand paging or support for IEEE arithmetic trap handlers. It can be implemented in hardware or with software support, but can be difficult in the face of long floating point instructions whose operands were overwritten by another instruction, as well as out-of-order execution.

Because of the difficulties and inefficiencies of precise interrupts on RISC machines with pipelining, some processors provide two modes -- one for precise interrupts and one for imprecise interrupts.

What are some ways of dealing with interrupts?

There are several ways:


(Winter 1998) What happens on an interrupt?

(Winter 1998) How do you return to the restart address?


What is a basic block and its importance?

A basic block is a straight-line sequence of instructions without branches. We are interested in them because this is where pipelining works best, and we would like to make them as large as possible to increase parallelism.

How can the size of a basic block inside a loop be increased?

We can do loop unrolling, where two or more iterations are expanded into the body of the loop, and the loop variable is modified to take this into account.


(Winter 1999) What is loop unrolling?

Loop unrolling is the replication of the body of a loop inside the loop, thereby lowering the number of iterations to be performed. For example:

for (i = 0; i < 100; i++)
  g ();
becomes
for (i = 0; i < 100; i += 2)
{
  g ();
  g ();
}

(Winter 1999) Why do we do loop unrolling?

To save branches, which are expensive on pipelined machines. It can also decrease the number of times we test the loop condition.

(Winter 1999) Why can it slow things down sometimes?

Unrolled code might require more cache rows, and result in cache misses fetching the instructions, where before it was all executing out of cache.


What is register renaming?

Register renaming is a way of preventing name dependences between instructions (in particular WAR and WAW hazards). A name dependence occurs if two instructions use the same register or memory location, but not data is transmitted between them. A table is set up mapping the logical registers to a set of physical registers. Once an instruction is completed, the results are saved in a temporary register, and instructions further down the pipeline that reference the result of the computation are changed to refer to the temporary register until the first instruction completes. If there are no available registers for a given instruction result, a stall occurs.


What is dynamic scheduling and its benefits?

Dynamic scheduling removes the restriction that instructions must be executed in order by allowing the hardware to rearrange instructions. The benefits are a more simplified compiler, the ability to handle dependencies that are unknown at compile time, and the ability to execute code that was written with a different pipeline in mind.

What is scoreboarding?

Scoreboarding is a technique for allowing pipelined instructions to execute out of order when there are sufficient resources and no data dependencies. The scoreboard is a centralized table that keeps track of the status of instructions, functional units, and registers. Instructions are issued when everything is ready, or stalls them if hazards exist.

What is Tomasulo's algorithm?

Tomasulo's algorithm is a method of implementing dynamic scheduling. Register renaming is used to prevent WAW and WAR hazards, and each execution unit has a number of reservation stations to temporarily hold instructions and operands. It was designed to handle architectures with a small number of floating point registers, long memory latency (this was before caches), underutilized functional units, and name dependencies.


What is dynamic branch prediction?

Dynamic branch prediction is a method of combining history information about branches with heuristic algorithms to predict a branch. This is supported in hardware with a branch prediction buffer or a branch target buffer.

What is branch prediction buffer?

A branch prediction buffer uses a small associative array that is indexed by the lower bits of the branch instruction address, and contains information about recent executions of the branch. An address is presented to the buffer, and if it is there, the associated bits are used to predict whether the branch is taken or not.

What is a branch target buffer?

In a normal branch prediction scheme, we must wait until the instruction decode stage to find the destination of the branch. If however, we were to store the destination address of the branch after it is computed, then a subsequent prediction could supply this address instead of the prediction bits at the end of the instruction fetch stage. A branch target buffer is a table of destination program counters, indexed by the current program counter.

During an instruction fetch, the current PC is compared to the entries in the buffer, and if an entry exists, then we know that the current instruction is a branch. Knowing this, we can then change the PC to the associated prediction in the buffer.


What does it mean for a processor to be superscalar?

Superscalar processors issue varying numbers of instructions each clock cycle, which in turn implies multiple pipelines. They may be statically scheduled by a compiler or dynamically scheduled using techniques based on scoreboarding or Tomasulo's algorithm. The other type of multiple-issue processor is one that uses VLIW. The major benefits of superscalar architectures over VLIW architectures are that there is little impact on code, and unscheduled programs or those compiled for older implementations can be run.

What does VLIW mean?

Very Long Instruction Word processors issue a fixed number of instructions each cycle that are encoded into one very large instruction. These instructions are scheduled and packaged together statically by the compiler. The other type of multiple-issue processor is a superscalar processor. The benefit of VLIW over superscalar architectures is the simpler hardware.


Assuming ideal conditions, if I have a pipelined machine with n stages, how fast does a pipelined instruction execute compared to the same instruction on an identical machine that isn't pipelined?

Let T be the execution time on an unpipelined machine. Under ideal conditions, the time between completion of consecutive instructions is T/n. "Ideal conditions" implies that each pipeline stage does exactly 1/n of the work of an instruction, and that there are no stalls.

Let's say we have a non-pipelined computer that previously took 2.3 microseconds to execute an instruction, and now it has a pipeline with three stage that take .5, .8, and .9 microseconds each. What is the speedup?

It's the old time divided by the new time. Assuming that all three stages can execute in parallel without conflicts, the speedup would be 2.3/.9.

Suppose I had a 5 stage pipeline, where each stage takes 20ns. If I add 2 more stages to the pipeline that take 20ns each, how does that affect my execution time?

Since each stage takes 20ns, if there are no stalls, the best that can happen is that an instruction exits the pipeline every 20ns, and thus we have no gain over our initial pipeline. In fact, we may make the execution time for a given instruction longer, since we pay a time penalty for the latches between 2 pipe stages.

The implication of this is that adding pipe stages doesn't buy you anything unless you can make the time of the longest stage (which will define the time of a pipe cycle) faster.


What is software pipelining?

Software pipelining is the reorganization of a loop so that each iteration of the software pipelined code is made from instructions chosen from different iterations of the original loop. Instructions from sequential loop iterations are interleaved to spread out the dependencies between instructions in a single loop in a manner similar to what Tomasulo's algorithm does in hardware.


What is trace scheduling?

Trace scheduling is a method used with VLIW architectures to increase the available parallelism. By considering the likelihood of branches, trace scheduling allows more than one basic block to be considered when looking for potential parallel instructions. In the first step, a sequence of code is selected by the compiler (only one side of a branch is used) based on the maximum likelihood of occurrence. Then the trace is compacted into a small number of VLIW instructions while respecting data and structural hazards. Bookkeeping code is inserted to maintain the correct data dependencies.


What is delayed branching?

Because a CPU can not know which instruction will be executed after a branch, pipelined processors will often execute an instruction placed immediately after a branch. Compilers can then schedule some useful instruction that needs to be executed regardless of the branch outcome.


What are conditional instructions?

Conditional instructions are a type of speculation in which an associated condition is evaluated during the execution of an instruction. If the evaluation fails, the instruction turns into a NOOP, otherwise it completes normally. An example is a conditional move.


What's a TLB?

See the answer in the operating systems section.

How are virtual addresses translated into physical addresses using a TLB?

Refer to the figure below for a fully associative TLB. First the high order bits of the virtual address are sent to all the tags in the TLB and they are compared in parallel. The memory access type is also checked for a violation. If the tag matches a line in the cache, then the physical frame is sent out of the multiplexer, after which the offset is added to form the physical address.

TLB Figure

Why must the TLB be flushed on a context switch?

Since the TLB translates virtual addresses to physical addresses and processes don't share physical memory in general, the virtual address translations in the TLB will not be valid for the new process.


(Winter 2000) Sketch the first steps of a memory access in a system that has a L1 cache and a TLB.

Two approaches are possible. If we have a virtual memory L1 cache, then the cache is first accessed to see if the requested word is in it, then the TLB is accessed - in case of a miss - in order to fetch the corresponding physical frame or disk block. These two operations can also be performed in parallel. If we have a physical memory cache, we need to perform the translation virtual->physical first, by using the TLB and then accessing the cache.


(Spring 1995) What's a cache?

A cache is a small amount of high-speed memory close the the CPU that temporarily stores recently accessed data. When the CPU needs a certain piece of data, the cache is checked first, and if it is there that copy is used, otherwise main memory is accessed. The idea that makes caches successful is locality of reference, that recently used data or data that is physically close to previously used data are more likely to be used in the near future.

(Winter 1997) What is a cache line?

It is a group of several words that have been brought from a contiguous chunk of main memory.

(Spring 1996, Winter 1997) Explain how a cache hit or miss is determined, given a memory address. Feel free to use diagrams.

A memory address is broken up into three parts: the tag, the index and the offset. The size of the index is the log of the number of lines per set in the cache, the size of the offset is the log of the number addressable data pieces per line. The rest of the memory address is the tag. If we have a (say) 2-way set associative cache, we look for the cache line in each set corresponding to the index of our memory address. There is one such line for each set. We then compare the tags of each line to the tag of our memory address. If they match, we have a cache hit. On a hit, we get the data within the line that we need by indexing into the line by the offset.

The following diagram shows the decomposition of memory and caches:

Cache Figure


What are the types of cache misses?

Describe some methods to decrease the number of cache misses.

What are some methods to improve the miss penalty for a cache?


(Winter 2000) What is a virtual cache? Are there any problems with using one?

A virtual cache uses virtual addresses instead of physical addresses to speed up the hit time. The problem is that the cache must either be flushed on a context switch, or the PID must be stored with the address. Storing the PID will not solve another problem -- the use of address aliasing, where two virtual addresses refer to the same physical location. Also, I/O usually needs physical addresses, so some translation is needed.

(Winter 2000) Is there a way of working around this problem?

One way is to use the page offset instead of the tag to index the cache, since it is the same in both virtual and physical addresses. While the indexing is happening the virtual address translation can happen in parallel, and then the tag can be matched.


Consider two equally-sized caches, one of which is direct mapped and the other is two way set-associative. There are a total of 256 lines, with 8 words per line, and 4 bytes per word. The machine has 32-bit addresses and is byte addressable, with a word of 4 bytes. How many bits are used for the tag, index, and offset?

In both caches we need 5 bits for the offset since there are 8 words/line × 4 bytes/word = 32 bytes/line. In the two way set-associative cache, we have 128 available lines that we wish to index into, which will require 7 bits. This leaves 32 - 7 - 5 = 20 bits for the tag. The direct-mapped cache will have 256 lines, which results in 8 index bits and 19 tag bits.

Construct some example code in which the set-associative cache will work better.

for (i = 0;i <= 256; i++){
  A[i]++;
  B[i]++;
}

If we align the arrays so that their memory address indexes and offsets match, then the set-associative cache will have fewer cache misses, as opposed to a cache miss on every increment instruction for the direct-mapped case.

Assuming sequential access of one array, what will be the hit rate for either cache?

Since there are eight words in a line, there will be a cache miss every 1/8 memory accesses. So the hit rate would be 7/8.


What are the four major policy decisions that need to be made at each level of the memory hierarchy?

Compare and contrast write back and write through.

With write back, writes occur at the speed of local memory, and only one write to the next lowest level is needed. Since few writes to the next level are necessary, there is less use of memory bandwidth. With write through, a read miss and subsequent replacement of a segment of memory can not cause a write to occur, and is easier to implement. Write through also keeps the local copy the same as the one lower in memory (coherent), which is useful for I/O and multiprocessor system with shared memory.


What is SRAM and DRAM? What is the difference between the two?

SRAM is faster and does not require a refresh after a read, but consumes a lot of power. DRAM, on the other hand has to be periodically refreshed, and every read must be followed by a write. (Fortunately most chips do refreshes internally as long as they are set to the proper mode and clocked periodically.) DRAM only consumes power when it is being read, written, or refreshed, which makes it run cooler. The reason we use DRAM is because it is cheaper and about four times as dense as SRAM.

How is memory physically laid out?

Memory is organized as a rectangular matrix that is addressed by row and column. As DRAMs grew in size, the number of pins became an issue, so that nowadays the pins are multiplexed so that the row addresses are sent and then by the column addresses.

Briefly summarize the memory hierarchy, with rough figures for sizes and speeds.


How long does a memory access take?

Average access time = Hit time + Miss rate × Miss Penalty

What are some ways to speed up access to main memory?

(Winter 1999) Why aren't memory speeds increasing as quickly as processor speeds?


What is the cache coherency problem?

When a cache value and the next level in the hierarchy do not have the same value for a given address, they are said to be incoherent. For CPU caches, incoherency can also occur from the lower side if an I/O device writes to main memory but not the cache. A good example is the use of a disk cache that uses write back. If the power goes out, the disk will not reflect the true state of the system.

What's a snoopy cache?

Snoopy caches are used in multiprocessor, local cache architectures to maintain cache coherency in situations where the local caches may contain copies of data. Snoopy caches "snoop" the bus, and if data is modified that happens to have a copy in the local cache, the local copy is updated. The alternative to such a method is to use a directory based coherency in which the sharing information for data is kept in a central location.


(Winter 1997) What is memory-mapped I/O? What are the benefits and downsides?

The benefit given did not seem to satify the committee. Maybe there is a speed gain associated with mapping directly to memory?

With memory-mapped I/O, portions of the address space are mapped to hardware device registers, and reads and writes to those addresses cause data to be transferred. Each I/O device has associated with it a distinct set of addresses. The benefit of this method is that the semantics for transferring data to an I/O device are much simpler that doing a full read and write command. For the process, the I/O is transparent. The major problem with the method is that the computer can not use that memory location (can't be paged, e.g.) The PC COM port is a good example.

(Winter 1997) How does DMA compare to memory-mapped I/O?

DMA is a hardware method of relieving the CPU of the task of doing I/O. The DMA controller is set up to transfer a block of data into a portion of memory, and when it completes it sends an interrupt to the CPU. This is much better than if the CPU has to do it, resulting in an interrupt every time a word is transferred.


What is the problem with DMAs and virtual memory?

DMA typically uses physical addressing, so transferring a buffer larger than a page can cause problems. Likewise the operating system could remap pages while the DMA controller is writing. One way of dealing with this is to have the OS lock the pages in memory and perform the translation for the DMA controller, then unlock the pages. Another solution is to use virtual DMA, where the addresses are virtual instead of physical.


What is error correcting memory?

Error correcting memory is memory that can detect errors and correct them. Most error correcting memories can detect and correct up to a few bits in a word.

What can cause transient errors in memory?

A gamma ray could hit the casing of our computer and cause the emission of an alpha particle, which could then collide with our memory and cause a bit to flip.


What are the types of buses, and when are they better?


What is Ripple Carry Addition?

With ripple carry addition, the low order bit pair are added, and the carry is sent to the next pair, which is then added, etc. The problem with this method is that it is very slow.

What is Carry Lookahead Addition?

The idea behind carry lookahead addition is that in some situations the carry bit is known in advance (namely, 1+1+C, whatever C is). Some common terms may then be factored out, thereby simplifying the resulting implementation of multiple-bit addition.

How can multiplication be done on two integers?

The result is repeatedly shifted and the multiplicand is added if the corresponding digit in the multiplier is a 1. Some clever implementations same some space by recirculating the bits of the multiplicand and multiplier back in to the vacated space as the computation progresses.

Multiply
Figure

What is the basic algorithm for division?

Division is almost the opposite of multiplication. The dividend is shifted to the left and the divisor is subtracted. If the result is positive, then the quotient has the corresponding bit set as a 1 and the remainder replaces the dividend. If the result is negative, the quotient gets a 0, and the dividend is restored to the value before the subtract. As in multiplication, some clever implementations save space by recirculating bits.

Divide Figure


What is a Harvard architecture, and what are the benefits of such an architecture?

A Harvard architecture originally meant a computer that had separate memories for data and instructions. Today the term is typically used to refer to architectures with split instruction and data caches. Such a design would reduce the structural hazards that might arise from contention for access to the cache.


Explain the Flynn Taxonomy of parallel computer architectures.


What is NUMA?

NUMA stands for non-uniform memory access, and is a synonym for distributed shared-memory. With NUMA, the address space is shared between processors, and a memory access may take a nonuniform time depending on what the address is.


Explain the difference between circuit switching and packet switching.

Circuit switching is like the original telephone system. A connection is made that lasts the duration of the communication. It is easy to implement, but wastes a lot of bandwidth. Packet switching divides the information into packets containing destination information, which are then sent out into the network and forwarded to their destinations. This method requires extra routing effort, congestion control, and reordering of packets, but uses more bandwidth.

What is the difference between shared and switched media?

Shared media such as ethernet must have some sort of coordinating mechanism that only allows two computers to communicate at a time. Switched media such as ATM allow communication directly from multiple disjoint sources to destinations.


What is a vector computer?

Vector processors operate upon large vectors of values at a time. They are best suited for number-crunching tasks in which there is little interaction between the data. With vector processors, the computation is independent of the previous results, which allows very deep pipelines. Because each instruction operates on a set of data, loops and their associated control hazards are nonexistent, and the instruction bandwidth is reduced. Memory access have a pattern, so that the high memory latency is amortized.

What is the vector stride?

The vector stride is the distance separating elements that are to be merged into a single vector. (e.g. Data accessed in a column-major fashion in a row-major memory layout would have a vector stride equal to the size of the row.) In a vector machine load/store instructions are often augmented with a stride parameter.


Back Back to the Qualifying Exam Study Guide.

Last changed January 30, 1997.

David Coppit, david@coppit.org