Pentium 4 (Partially) Previewed

Intel Lifts Veil on Hyperpipelined CPU—But Not All the Way

By Peter N. Glaskowsky [8/28/00-01]

Intel has released a few more details of its next IA-32 processor, formerly code-named Willamette and now slated to be sold as the Pentium 4. At the Intel Developer Forum last week, Intel’s CEO Craig Barrett and vice president Albert Yu provided interesting insights into the design goals and microarchitecture of the Pentium 4’s new core. The announcement answered some questions but begged others. For example, the P4 will be announced at “at least” 1.4GHz, according to Intel, but the company has said nothing about the P4’s performance relative to the Pentium 3, currently shipping at 1.13GHz.

The higher operating frequency of the new part is made possible by a hyperpipelined core with 20 stages. Intel calls this new architecture NetBurst rather than P7, or some other sequential code-name of the type used in the past. As Figure 1 shows, the NetBurst pipeline is twice as deep as that of the P6, which in turn had twice the depth of the P5’s. Increasing pipeline depth increases logic complexity and branch penalties, but it also allows clock speeds to increase. We expect the new core to reach 2GHz—a speed demonstrated at IDF—before it moves to a 0.13-micron process in 2001.

The two Drive stages shown in Figure 1 represent time required to move signals across the chip. No other work is done during these stages. As far as we know, NetBurst is the first pipeline with dedicated stages for wire delays. Although the new pipeline has one execution stage, the P4’s two ALUs execute many operations in one-half of a clock period. Shifts and some other operations still spend one full clock period in the ALU, and these operations must start at the beginning of the period. Since the rest of the pipeline can process only two ALU operations per clock, the faster ALUs don’t increase peak throughput—but they do boost sustained throughput. When two ALU operations are ready for execution, one of which depends on the results of the other, the Pentium 4 can complete the first operation in the

---

**Figure 1.** The new hyperpipelined NetBurst microarchitecture of the Pentium 4 allows its clock rate to be increased significantly over that of the Pentium 3. This figure, based on information provided by Intel, shows the portion of each pipeline involved in ALU operations under branch mispredictions.
first half of a clock period, then complete the dependent operation in the second half. In the same situation, the P3 core would take two clock periods to execute the two operations. Intel says that under some conditions the P4 ALUs can execute a total of four operations per clock cycle, but the company has not explained the circumstances in which this can be accomplished.

We will have to wait to see how the deeper pipeline affects overall architectural efficiency, however, as Intel has not released any P4 performance data. Deep pipelines typically complete fewer instructions per clock period (IPC), especially on highly conditional code, such as that often found in PC productivity software.

We expect NetBurst to achieve a lower average IPC, but the actual effect will vary, depending on the software being executed. Multimedia algorithms, for example, are less sensitive to pipeline depth, since they have a higher ratio of calculations to branches, and branches are more predictable in such programs. The new Intel architecture should shine on multimedia tasks, especially with the other enhancements Intel has made to the Pentium 4.

**SSE2 Doubles Multimedia Throughput**

The most significant improvement in the P4 for some media-processing applications is not its higher clock speed but the improved Streaming SIMD Extensions engine, now called SSE2. The adder and multiplier units in the SSE2 engine are 128 bits wide, twice the width of the SSE engine in the Pentium 3. The greater width allows the P4 to process twice as much data per instruction and enables SIMD processing of 64-bit values, including double-precision floating-point numbers. The new engine still can’t perform single-cycle multiply-add operations, however; these take two cycles.

Another improvement over the P3’s SSE unit lies in SSE2’s improved load/store bandwidth. In each clock period, the P4 can complete one 128-bit load plus one 128-bit store between the L1 data cache and the XMM registers. The new chip thus has twice the peak transfer efficiency of the Pentium 3. The increased throughput reduces the likelihood of data starvation and doubles the effective performance of the SSE2 engine on many common multimedia algorithms.

Intel added 144 new SSE2 instructions to the P4, although almost all of these are just 128-bit versions of existing MMX and SSE instructions that operate on 64 bits of data. There are some new cache and memory-management instructions, but cryptographers will be especially pleased with new packed 32*32 unsigned multiply and packed 64-bit add and subtract operations. These new instructions will boost throughput on cryptographic algorithms that process very large integer values—sometimes as long as 2,048 bits. Intel says the new instructions will also accelerate video, speech, and image processing.

All the new instructions operate on the existing set of XMM registers, eliminating any concerns about software support. Intel says all SSE-compatible operating systems will support the P4.

The improved SSE2 architecture, combined with the P4’s faster clock speed, pushes the new CPU’s multimedia performance ahead of the multimedia performance of the IBM/Motorola PowerPC G4. The G4 can execute eight floating-point operations per clock period (four single-precision multiply/adds), or 4 GFLOPS (peak) at 500MHz. Although the P4 does just half as much work per clock period, a 1.5GHz clock speed yields 6 GFLOPS of peak floating-point throughput. The difference is attributable to the P4’s deeper pipeline. Today’s G4 has only five pipeline stages and achieves only one-third the clock speed of the Pentium 4.

**Trace Cache Stores Micro-Ops**

Significant changes were also made to the P4’s caches, as shown in Figure 2. The L2 cache—though still 256K in size—is now twice as fast, delivering 256 bits of data in each clock period. The L2 cache now has 128-byte lines—four times as wide as those in the P3’s L2 cache—but each line has two valid bits, one for each half line. These bits allow accesses to cachable memory to be a more manageable 64 bytes. Like the deeper pipeline, this change favors multimedia processing over typical productivity software, which makes heavier use of short (often byte-wide) loads and stores. As with the P3’s L2 cache, the P4’s L2 cache holds both instructions and data.

As noted above, the L1 data cache, which is 8K in size, was improved to provide twice the throughput of the P3’s L1 D-cache for SSE2 operations. Intel says the improvement in bandwidth more than offsets the reduced capacity of this cache when compared with that of the Pentium 3, which has a 16K L1 D-cache. Latency was also improved; according to Intel, average cache latency in a 1.4GHz P4 is

![Figure 2. The NetBurst core features an execution trace cache and multiple execution units. Almost all of the elements of the new core design are substantially changed over the corresponding elements in the previous P6 core.](image-url)
just 55% of that in a 1GHz P3, or 22% as measured in clock cycles.

The Pentium 4 also uses a technique called data speculation to boost the effective speed of the L1 data cache. ALU micro-operations that depend on a preceding load instruction enter the execution pipeline before the CPU has determined if the load hit in the L1 D-cache. If the load hits in the L1, the ALU operation proceeds normally. If the load misses in the L1, the ALU operation is aborted. A "replay" feature allows the aborted micro-operation to be repeated when the data is available.

The L1 instruction cache is even more dramatically changed. Where the P3's L1 I-cache holds x86 instructions mapped to the system physical-address space, the P4 takes a completely different approach. The new chip has an execution trace cache that holds approximately 12,000 micro-ops, decoded subinstructions derived from the x86 instructions in the original program stream. Instruction-address references within this cache are adjusted to point to the address of the target micro-op in the trace cache, creating a new address space that is local to the CPU core and not externally visible.

The goal of the new L1 architecture was to remove IA-32 instruction decoding from the main execution loop. Where decoding took 3 of the 10 stages in the P6 pipeline, it is not required at all in the NetBurst pipeline. Locating and fetching the next micro-op, however, still accounts for 4 of the 20 NetBurst stages. This leaves 16 pipeline stages to do the work of just 5 stages in the P6 pipeline, suggesting the actual ratio of pipeline expansion is closer to 3:1 than 2:1.

Intel says that up to 126 different micro-ops can be in flight in the new core at one time; of these, up to 48 may be loads, and up to 24 may be stores. This is three times as many loads and twice as many stores as can be pending in the P6, and more than three times as many total pending micro-ops. Having more pending instructions is one way to recoup, in part, the IPC lost by the deeper pipeline.

Another way to improve average IPC is to improve branch prediction. Although the company has not described the branch-prediction algorithm in the P4, Intel says the target array is now eight times larger, with 4K entries. According to the company, the new branch-prediction scheme removes one-third of the mispredictions generated by the P3 branch predictor.

**Processor Bus Boosted to 400MHz**

Not all improvements to the Pentium 4 are internal. For some applications, the 400MHz bus of the new chip (four data phases per clock period at 100MHz) is more important than any microarchitectural innovation. The Pentium 4 will have three times the peak bandwidth of the Pentium 3, and, with longer 64-byte transfers to match the half-line length in the L2 cache (plus other protocol enhancements), the new chip will see even more of an increase in sustained throughput. Intel says the data-buffering capacity of the new bus interface was “significantly increased” over that of the P3, from which we infer that most of the buffers are twice as wide—to accommodate the wider cache lines—but that some buffers may not be as deep.

The new bus uses Assisted Gunning Transceiver Logic (AGTL+) signaling levels, as does the bus interface used on the Pentium II and Pentium III Xeon (see MPR 7/13/98-01, “Xeon Replaces Pentium Pro”). Data signals are accompanied by two strobe signals that operate at twice the bus clock rate and 180° out of phase with each other. The falling edges of these strobe signals—four per clock period—indicate each data phase. (Addresses on the processor bus are transferred with two phases per clock period.) This strobe scheme reduces the adverse effects of skew, but P4 motherboards still require length-matched bus traces.

Intel also announced the 850 chip set, which matches the 3.2GB/s bandwidth of the processor bus to two channels of Rambus RDRAM. The 850 is essentially equivalent to Intel's 840 workstation chipset set, except for two changes: support for the P4 bus and inclusion of the ICH2 I/O controller hub announced recently (see MPR 6/5/00-02, "Intel Expands 820 I/O Options"). The 850 MCH is packaged in a 615-contact BGA and dissipates 8W. A heat sink is required for all system configurations.

The 850 supports up to 32 RDRAMs on each channel, giving it a maximum memory capacity of 2G, using the newest 288Mb RDRAM chips. This memory capacity is sufficient for today's leading-edge desktop applications, but few users will be able to afford such an array, which we estimate will cost at least $8,000.

Intel described a six-layer stackup for P4/850 motherboards. It is unclear whether four-layer construction will be possible on these systems. If not, P4 motherboards will also be significantly more expensive than current P3 motherboards, adding about $200 to the cost of the system.

Intel recently said it is developing a chip set for the P4 using SDRAM or possibly double-data-rate (DDR) SDRAM, but this chip set won’t arrive until next year. We question the commercial viability of an SDRAM-only chip set for Pentium 4, however. It would be difficult to justify combining Intel’s most expensive CPU with the slowest available memory, no matter how great the cost savings.

VIA, for its part, has said publicly that it plans to develop alternative DDR-based chip sets for P4, even in the absence of an Intel license to do so—a decision sure to stir up some controversy. We expect Intel to refrain from taking legal action against VIA until it can measure the effects of the VIA offering on P4 processor sales. If VIA’s core-logic support boosts processor sales beyond what Intel can achieve on its own, we believe Intel will allow VIA to continue selling those products.

**Large Die Leads to Power Concerns**

The combination of features found in the Pentium 4, however, has given the chip 42 million transistors and, accord-
ing to some reports, a 217\text{mm}^2 die size. The Coppermine
die, by comparison, is 106\text{mm}^2 and has just 26 million tran-
sistors. The complexity of the NetBurst core accounts for
the majority of the new transistors.

Along with the bigger die comes higher power con-
sumption. In IDF presentations, Intel said the 1.4GHz P4
consumes 39A at 1.7V (66.3W). The company recommended
motherboards be designed to provide 52A of supply current
at the same voltage. Handling these power levels will require
highly effective heat sinks.

At IDF, Intel demonstrated sophisticated new heat
sinks designed to provide the required cooling. One type is
created by attaching a thick copper baseplate to a thin alu-
minum sheet folded into many pleats. These folded fins
achieve a much higher ratio of fin surface area to volume
than can be achieved in an extruded-aluminum heat sink.
Because of their high fin density, these units require a fan to
push air through the fins. The heat sink shown at IDF
weighs 450g (about one pound); most of this weight is in
the copper baseplate.

Another type also uses a copper baseplate as a heat
spreader. This plate is welded to a thicker aluminum plate.
Fins are then skived (carved) from the aluminum by a
sharpened wedge, creating thin, curved fins that remain
attached along one edge to the aluminum/copper baseplate.
This method of construction avoids the thermal inefficiency
of bonding fins to a baseplate by welding or by using an
adhesive.

Skived heat sinks are significantly more efficient than
extruded or folded-fin heat sinks, allowing many systems to
operate safely without an extra fan on the CPU heat sink.

Dispensing with this fan increases system reliability. We
expect many OEMs to continue using conventional heat
sinks along with fans, however, because of the current high
cost of skived heat sinks.

Intel also described how it plans to handle the weight of
P4 heat sinks. Because the P4 is a socketed chip—using a 423-
pin PGA with 100-mil interstitial pin spacing, essentially a 72-
mil-square grid rotated 45 degrees—and not a module in a
card slot, it cannot support a heat sink without help. Intel pro-
poses to add four mounting holes to the ATX motherboard
specification. These holes are located at the corners of a rec-
tangle centered on the CPU socket. Through these holes, the
heat-sink assembly can be attached directly to mounting studs
built into the system chassis. This scheme is similar to that
used with Xeon processor modules, which are even heavier.

Pricing, Performance Remain to Be Seen
The P4’s new architecture is likely to produce the highest
clock speeds in the PC-processor industry, but if application-
level performance doesn’t scale with the clock speed, some
will say the P4 was designed to look better than it is. Many
features of the new chip, such as its faster L2 cache, faster
bus interface, and SSE2 engine, could have been added to
the basic P3 design. These features would not have boosted
the clock speed of the P3 core but would nonetheless have
improved application performance, possibly with less
impact on die size and cost.

The Pentium 4, the necessary Intel core logic, and sys-
tems using the new chip are due out later this year. We expect
to see the P4 announcement before Comdex, along with a
few system announcements from Intel’s OEM partners. The
Pentium 4 is likely to take over the high end of Intel’s desk-
top product line, commanding a price that should be around
$1,000 for the fastest versions. Systems should cost about as
much as today’s PC workstations using the Pentium 3 and
the 840 chip set, which also uses two channels of RDRAM—
typically $3,000 or more, depending on configuration.

At these prices, Intel shouldn’t expect to sell many
Pentium 4s, but users who get one of these rare and expen-
sive machines will be fortunate indeed.

Price & Availability
Intel has not released pricing for the Pentium 4, which
will be announced later this year. For more information,
visit the Intel Web site at www.intel.com

To subscribe to Microprocessor Report, phone 408.328.3900 or visit www.MDRonline.com